SlideShare a Scribd company logo
1 of 21
Download to read offline
March 10–11, 2015
San Jose
ENGINEERING WORKSHOP
Health monitoring & predictive analytics
To lower the TCO in a datacenter
Christian B. Madsen & Andrei Khurshudov
Engineering Manager & Sr. Director
Seagate Technology
christian.b.madsen@seagate.com
ENGINEERING WORKSHOP
1. The opportunity
2. Our vision and implementation
3. Use cases
4. Summary
Outline
The opportunity
ENGINEERING WORKSHOP
Optimize system management
Reduce potential cost of operation
Improve datacenter efficiency
What if…
Seagate offered you a technology that could help you
ENGINEERING WORKSHOP
▪  1 billion hard drives will be used in cloud datacenters by
2020, highlighting the need to manage drive health at scale
▪  One total outage per datacenter is statistically expected
every year
▪  80% of those outages are not completely explained (or
linked to root causes)
▪  $700,000 is the average cost per incident
▪  $8,000 is the average cost per minute of an unplanned outage
▪  Up to 10% of datacenter accidents are related to
storage
56%
56% of 13ZB
7.3ZB,
>1billion
drives
in cloud
Source: Seagate Strategic Marketing and Research 2013
2020
The problem
Failures in storage lead to costly outages
ENGINEERING WORKSHOP
Top 4 challenges in drive management
Better drive management will lower the TCO
1.  Drive health monitoring
–  Need reliable key performance indicators to track drive health
status
2.  Drive failure prediction
–  “Ultimately, we want to know when our drives will fail so we
can take actions before that happens”
3.  Drive failure diagnostics and management automation
–  Need to correctly identify and quickly resolve issues
–  Need to prevent false alerts to reduce cost of failure handling
4.  Drive lifespan extension
–  Need to know how to optimize operating environment for
better reliability
–  Need to reuse partially good drives (should be possible with in-
drive diagnostic, IDD)
Our vision and implementation
ENGINEERING WORKSHOP
Our vision
•  Drive-centric health monitoring
•  Analytics and predictive models
•  Closed-loop automation
Data
Center
Global Access
• Report storage health
• Run drive self-test
• Shut-down systems
• Repair drives
• Run auto-FA
• Point at an issue
• Highlight inefficiency
• Predict reliability
• Detect anomalies
Actionable Decisions
ANALYTICS
PREDICTIONS
CONTROL
Data Aggregation
Early Warning System
Quick Issue Resolution
Cloud GazerTM
MONITORING
Monitoring, analytics, prediction and control – “The internet of things”2
ENGINEERING WORKSHOP
Functional diagram
Monitor
Exception (alert)
Compliance (threshold)
Recommended action
Resolution
Closed-loop automation
Yes
No
Choose action
Monitor
Drive predicted to fail
Drive health
Reset or turn off drive
Turn off drive
Example
Passes
Not passing
Choose action
Automation
Choosing action from recommended
action can be automated by tying it to the
specific application or saving choices
Monitoring, intelligent decisions and automation
ENGINEERING WORKSHOP
Real-time metric
aggregation Cloud
GazerTM
Dashboard
ReST Platform
•  Query data
•  Check thresholds
•  Manage drives
Data pool
(10,000s of drives) Server
Storage
Software
REST API
Calls
•  Cluster
•  Server
•  Drive
Agents
Storage
Server
Storage
Server
Storage
Server
Storage
Server
ReSTful API
Drives Drives
Drives Drives
Storage
Server
Storage
Server
Cloud GazerTM
Elements
Storage eco
system
Analytics
Engine
Cloud GazerTM
Drive Monitoring and Analytics
Drives
Drives
Storage Software
noSQL
ReSTful
API
Cloud GazerTM Dashboard
Architecture overview
Implementation
Use cases
ENGINEERING WORKSHOP
Danger zone
Datacenter failure rate
Failure detection
Warning about expected drive failures.
Relies on the proprietary failure prediction
algorithms that use unsupervised machine
learning techniques. Expected average
failure prediction time window is from 9 days
to 12 days.
HDD population failure rate
Measuring stress and estimating
failure acceleration of the disk
drive population in real time. Relies
on the proprietary failure prediction
algorithmsOverload detection
Detecting and reporting when
drive load exceeds design limits
Compliance (thresholds)
Recommended action
How to increase drive reliability
Degradation and performance warnings
ENGINEERING WORKSHOP
Workload optimization
Workload
predominantly
hitting one
server
Before Load balancing issues After Workload distributed over servers and time
WeekDay Month
Workload
peaked on
Sunday
Drive visibility tools to improve workload balancing
ENGINEERING WORKSHOP
Unsupervised machine learning and failure prediction
…Drives in field
Multivariate
time-series
monitoring
Apply failure
prediction
algorithm in
parallel in real-
time
Real-time
status
prediction of
drive – Fine or
going to fail
For now, an average failure prediction window is on the order of 9 to 12 days
Failure prediction accuracy ranges from 55% to 90%
No interaction between drive set, no prior knowledge
ENGINEERING WORKSHOP
Prediction and follow up actions
Systematic failure predicted:
3 out of 5 drives predicted to fail sit in
end location of servers
Heat map indicates drives at risk and you can issue drive tests (DST, IDD,…) to resolve or corroborate
ENGINEERING WORKSHOP
Find failure triggers
Systematic failure predicted:
3 out of 5 drives predicted to fail sit in
end location of servers
Common factors for drives in the end
position is a cooler temperature.
Therefore increasing the server
temperature may reduce the
(dominant) failure mechanism and
increase drive reliability
Root cause tools including a temperature heat map can help you triage the cause of your drive issues
ENGINEERING WORKSHOP
Failure prediction lead time
Currently catch 55-90% of
failures ahead of time
Case study 2, we predicted 5 drives to failed 23 days prior to
failure, 2 drives 22 days prior to failure,… 2 drives just one
day in advance
Case study 1, we predicted most drives (118 drives) to fail 12
days prior to failure
We can predict drives will fail on average 9-10 days before the failure
Summary
ENGINEERING WORKSHOP
•  Truly drive-centric management tool for the cloud
•  Most efficient tool for extracting drive health information using Seagate IP
▪  Nobody knows drives better than us
▪  Freeware utilities are frequently wrong
•  Runs on any Linux system with little overhead (<1%)
Windows is next
•  Data can be collected, monitored and analyzed locally or in the Cloud
•  ReSTful API to interact with other software
•  New analytics, prediction, AI, and control capabilities are added continually
•  Drive repair will be possible with in-drive diagnostic
•  Enclosure control will be possible by summer 2015
Simply SMARTer
Competition
Why Cloud GazerTM?
*Seagate drives
Questions?

More Related Content

Viewers also liked

生命教育--4塊糖的省思
生命教育--4塊糖的省思生命教育--4塊糖的省思
生命教育--4塊糖的省思chen bowei
 
The business of consulting (handout)
The business of consulting (handout)The business of consulting (handout)
The business of consulting (handout)Ian Gotts
 
Geekmeet Iasi Intro
Geekmeet Iasi IntroGeekmeet Iasi Intro
Geekmeet Iasi IntroGeekMeet
 
What Is Windows Azure
What Is Windows AzureWhat Is Windows Azure
What Is Windows AzureDominic Green
 
New Libertarian Manifesto
New Libertarian ManifestoNew Libertarian Manifesto
New Libertarian Manifestoguest12a2146
 
#IgNiteTH Keynote by iannnnn
#IgNiteTH Keynote by iannnnn#IgNiteTH Keynote by iannnnn
#IgNiteTH Keynote by iannnnniannnnn
 
Generell presentasjon
Generell presentasjonGenerell presentasjon
Generell presentasjonGlenn Melby
 
English Translation Of Go Forward Plan, Harvard Bus Review
English Translation Of Go Forward Plan, Harvard Bus ReviewEnglish Translation Of Go Forward Plan, Harvard Bus Review
English Translation Of Go Forward Plan, Harvard Bus ReviewThomas Nastas
 
2011 Be A Superhero - 'Why, How and What' Event!
2011 Be A Superhero - 'Why, How and What' Event!2011 Be A Superhero - 'Why, How and What' Event!
2011 Be A Superhero - 'Why, How and What' Event!Jason Dunstone
 
Referansegruppe 200209
Referansegruppe 200209Referansegruppe 200209
Referansegruppe 200209Glenn Melby
 
20100812 Comfi Web Presentation
20100812 Comfi Web Presentation20100812 Comfi Web Presentation
20100812 Comfi Web PresentationGunther De Backer
 
D4 I Framework Web
D4 I Framework WebD4 I Framework Web
D4 I Framework WebIain Sanders
 
How to Master UX Testing in an Agile Design Process
How to Master UX Testing in an Agile Design ProcessHow to Master UX Testing in an Agile Design Process
How to Master UX Testing in an Agile Design ProcessUserZoom
 
Outsourcing to India Publication IJAS 2010
Outsourcing to India Publication IJAS 2010Outsourcing to India Publication IJAS 2010
Outsourcing to India Publication IJAS 2010Debopriyo Roy
 
PresentacióN1
PresentacióN1PresentacióN1
PresentacióN1Alex_27
 
Designing a Customer Feedback Program to Measure and Improve User Satisfaction
Designing a Customer Feedback Program to Measure and Improve User SatisfactionDesigning a Customer Feedback Program to Measure and Improve User Satisfaction
Designing a Customer Feedback Program to Measure and Improve User SatisfactionUserZoom
 

Viewers also liked (20)

生命教育--4塊糖的省思
生命教育--4塊糖的省思生命教育--4塊糖的省思
生命教育--4塊糖的省思
 
Generation Myth
Generation MythGeneration Myth
Generation Myth
 
The business of consulting (handout)
The business of consulting (handout)The business of consulting (handout)
The business of consulting (handout)
 
Geekmeet Iasi Intro
Geekmeet Iasi IntroGeekmeet Iasi Intro
Geekmeet Iasi Intro
 
What Is Windows Azure
What Is Windows AzureWhat Is Windows Azure
What Is Windows Azure
 
New Libertarian Manifesto
New Libertarian ManifestoNew Libertarian Manifesto
New Libertarian Manifesto
 
Tolerability and Decision Making Discussion
Tolerability and Decision Making DiscussionTolerability and Decision Making Discussion
Tolerability and Decision Making Discussion
 
#IgNiteTH Keynote by iannnnn
#IgNiteTH Keynote by iannnnn#IgNiteTH Keynote by iannnnn
#IgNiteTH Keynote by iannnnn
 
Generell presentasjon
Generell presentasjonGenerell presentasjon
Generell presentasjon
 
English Translation Of Go Forward Plan, Harvard Bus Review
English Translation Of Go Forward Plan, Harvard Bus ReviewEnglish Translation Of Go Forward Plan, Harvard Bus Review
English Translation Of Go Forward Plan, Harvard Bus Review
 
2011 Be A Superhero - 'Why, How and What' Event!
2011 Be A Superhero - 'Why, How and What' Event!2011 Be A Superhero - 'Why, How and What' Event!
2011 Be A Superhero - 'Why, How and What' Event!
 
Referansegruppe 200209
Referansegruppe 200209Referansegruppe 200209
Referansegruppe 200209
 
20100812 Comfi Web Presentation
20100812 Comfi Web Presentation20100812 Comfi Web Presentation
20100812 Comfi Web Presentation
 
Dfi Portfolio
Dfi PortfolioDfi Portfolio
Dfi Portfolio
 
D4 I Framework Web
D4 I Framework WebD4 I Framework Web
D4 I Framework Web
 
How to Master UX Testing in an Agile Design Process
How to Master UX Testing in an Agile Design ProcessHow to Master UX Testing in an Agile Design Process
How to Master UX Testing in an Agile Design Process
 
Collaborate
CollaborateCollaborate
Collaborate
 
Outsourcing to India Publication IJAS 2010
Outsourcing to India Publication IJAS 2010Outsourcing to India Publication IJAS 2010
Outsourcing to India Publication IJAS 2010
 
PresentacióN1
PresentacióN1PresentacióN1
PresentacióN1
 
Designing a Customer Feedback Program to Measure and Improve User Satisfaction
Designing a Customer Feedback Program to Measure and Improve User SatisfactionDesigning a Customer Feedback Program to Measure and Improve User Satisfaction
Designing a Customer Feedback Program to Measure and Improve User Satisfaction
 

Similar to Lower Your Datacenter TCO with Drive Health Monitoring and Predictive Analytics

Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...
Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...
Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...TeamQuest Corporation
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]jrm1224
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentationperry57123
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliencyMasashi Narumoto
 
Performance tuning intro
Performance tuning introPerformance tuning intro
Performance tuning introaioughydchapter
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesJun Liu
 
NCET Tech Bite - Cloud Storage and Data Backup - June 2015
NCET Tech Bite - Cloud Storage and Data Backup - June 2015NCET Tech Bite - Cloud Storage and Data Backup - June 2015
NCET Tech Bite - Cloud Storage and Data Backup - June 2015Archersan
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
Data Center Monitoring and Management Best Practices: How You Can Benefit fro...
Data Center Monitoring and Management Best Practices: How You Can Benefit fro...Data Center Monitoring and Management Best Practices: How You Can Benefit fro...
Data Center Monitoring and Management Best Practices: How You Can Benefit fro...Upsite Technologies
 
IoT Evolution Expo- Machine Learning and the Cloud
IoT Evolution Expo- Machine Learning and the CloudIoT Evolution Expo- Machine Learning and the Cloud
IoT Evolution Expo- Machine Learning and the CloudValue Amplify Consulting
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureHARMAN Services
 
Server monitoring basics every sysadmin should know
Server monitoring basics every sysadmin should knowServer monitoring basics every sysadmin should know
Server monitoring basics every sysadmin should knowserver-finder.com
 
HPE Infosight: introduction & demo
HPE Infosight: introduction & demoHPE Infosight: introduction & demo
HPE Infosight: introduction & demoXylos
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUGSandesh Rao
 
How to manage and monitor large sql server estates
How to manage and monitor large sql server estatesHow to manage and monitor large sql server estates
How to manage and monitor large sql server estatesRed Gate Software
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 

Similar to Lower Your Datacenter TCO with Drive Health Monitoring and Predictive Analytics (20)

Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...
Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...
Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A ...
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentation
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Performance tuning intro
Performance tuning introPerformance tuning intro
Performance tuning intro
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 Slides
 
Performance Tuning intro
Performance Tuning introPerformance Tuning intro
Performance Tuning intro
 
NCET Tech Bite - Cloud Storage and Data Backup - June 2015
NCET Tech Bite - Cloud Storage and Data Backup - June 2015NCET Tech Bite - Cloud Storage and Data Backup - June 2015
NCET Tech Bite - Cloud Storage and Data Backup - June 2015
 
NCET Tech
NCET Tech NCET Tech
NCET Tech
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Data Center Monitoring and Management Best Practices: How You Can Benefit fro...
Data Center Monitoring and Management Best Practices: How You Can Benefit fro...Data Center Monitoring and Management Best Practices: How You Can Benefit fro...
Data Center Monitoring and Management Best Practices: How You Can Benefit fro...
 
IoT Evolution Expo- Machine Learning and the Cloud
IoT Evolution Expo- Machine Learning and the CloudIoT Evolution Expo- Machine Learning and the Cloud
IoT Evolution Expo- Machine Learning and the Cloud
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
 
Server monitoring basics every sysadmin should know
Server monitoring basics every sysadmin should knowServer monitoring basics every sysadmin should know
Server monitoring basics every sysadmin should know
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
HPE Infosight: introduction & demo
HPE Infosight: introduction & demoHPE Infosight: introduction & demo
HPE Infosight: introduction & demo
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
 
How to manage and monitor large sql server estates
How to manage and monitor large sql server estatesHow to manage and monitor large sql server estates
How to manage and monitor large sql server estates
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 

More from Andrei Khurshudov

Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...Andrei Khurshudov
 
Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...Andrei Khurshudov
 
clusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetclusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetAndrei Khurshudov
 
Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007Andrei Khurshudov
 
Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008Andrei Khurshudov
 

More from Andrei Khurshudov (9)

Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
 
Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...
 
Seagate_1
Seagate_1Seagate_1
Seagate_1
 
Using Big Data Analytics
Using Big Data AnalyticsUsing Big Data Analytics
Using Big Data Analytics
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
 
clusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetclusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheet
 
Long Term Data Storage 2007
Long Term Data Storage 2007Long Term Data Storage 2007
Long Term Data Storage 2007
 
Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007
 
Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008
 

Lower Your Datacenter TCO with Drive Health Monitoring and Predictive Analytics

  • 2. ENGINEERING WORKSHOP Health monitoring & predictive analytics To lower the TCO in a datacenter Christian B. Madsen & Andrei Khurshudov Engineering Manager & Sr. Director Seagate Technology christian.b.madsen@seagate.com
  • 3. ENGINEERING WORKSHOP 1. The opportunity 2. Our vision and implementation 3. Use cases 4. Summary Outline
  • 5. ENGINEERING WORKSHOP Optimize system management Reduce potential cost of operation Improve datacenter efficiency What if… Seagate offered you a technology that could help you
  • 6. ENGINEERING WORKSHOP ▪  1 billion hard drives will be used in cloud datacenters by 2020, highlighting the need to manage drive health at scale ▪  One total outage per datacenter is statistically expected every year ▪  80% of those outages are not completely explained (or linked to root causes) ▪  $700,000 is the average cost per incident ▪  $8,000 is the average cost per minute of an unplanned outage ▪  Up to 10% of datacenter accidents are related to storage 56% 56% of 13ZB 7.3ZB, >1billion drives in cloud Source: Seagate Strategic Marketing and Research 2013 2020 The problem Failures in storage lead to costly outages
  • 7. ENGINEERING WORKSHOP Top 4 challenges in drive management Better drive management will lower the TCO 1.  Drive health monitoring –  Need reliable key performance indicators to track drive health status 2.  Drive failure prediction –  “Ultimately, we want to know when our drives will fail so we can take actions before that happens” 3.  Drive failure diagnostics and management automation –  Need to correctly identify and quickly resolve issues –  Need to prevent false alerts to reduce cost of failure handling 4.  Drive lifespan extension –  Need to know how to optimize operating environment for better reliability –  Need to reuse partially good drives (should be possible with in- drive diagnostic, IDD)
  • 8. Our vision and implementation
  • 9. ENGINEERING WORKSHOP Our vision •  Drive-centric health monitoring •  Analytics and predictive models •  Closed-loop automation Data Center Global Access • Report storage health • Run drive self-test • Shut-down systems • Repair drives • Run auto-FA • Point at an issue • Highlight inefficiency • Predict reliability • Detect anomalies Actionable Decisions ANALYTICS PREDICTIONS CONTROL Data Aggregation Early Warning System Quick Issue Resolution Cloud GazerTM MONITORING Monitoring, analytics, prediction and control – “The internet of things”2
  • 10. ENGINEERING WORKSHOP Functional diagram Monitor Exception (alert) Compliance (threshold) Recommended action Resolution Closed-loop automation Yes No Choose action Monitor Drive predicted to fail Drive health Reset or turn off drive Turn off drive Example Passes Not passing Choose action Automation Choosing action from recommended action can be automated by tying it to the specific application or saving choices Monitoring, intelligent decisions and automation
  • 11. ENGINEERING WORKSHOP Real-time metric aggregation Cloud GazerTM Dashboard ReST Platform •  Query data •  Check thresholds •  Manage drives Data pool (10,000s of drives) Server Storage Software REST API Calls •  Cluster •  Server •  Drive Agents Storage Server Storage Server Storage Server Storage Server ReSTful API Drives Drives Drives Drives Storage Server Storage Server Cloud GazerTM Elements Storage eco system Analytics Engine Cloud GazerTM Drive Monitoring and Analytics Drives Drives Storage Software noSQL ReSTful API Cloud GazerTM Dashboard Architecture overview Implementation
  • 13. ENGINEERING WORKSHOP Danger zone Datacenter failure rate Failure detection Warning about expected drive failures. Relies on the proprietary failure prediction algorithms that use unsupervised machine learning techniques. Expected average failure prediction time window is from 9 days to 12 days. HDD population failure rate Measuring stress and estimating failure acceleration of the disk drive population in real time. Relies on the proprietary failure prediction algorithmsOverload detection Detecting and reporting when drive load exceeds design limits Compliance (thresholds) Recommended action How to increase drive reliability Degradation and performance warnings
  • 14. ENGINEERING WORKSHOP Workload optimization Workload predominantly hitting one server Before Load balancing issues After Workload distributed over servers and time WeekDay Month Workload peaked on Sunday Drive visibility tools to improve workload balancing
  • 15. ENGINEERING WORKSHOP Unsupervised machine learning and failure prediction …Drives in field Multivariate time-series monitoring Apply failure prediction algorithm in parallel in real- time Real-time status prediction of drive – Fine or going to fail For now, an average failure prediction window is on the order of 9 to 12 days Failure prediction accuracy ranges from 55% to 90% No interaction between drive set, no prior knowledge
  • 16. ENGINEERING WORKSHOP Prediction and follow up actions Systematic failure predicted: 3 out of 5 drives predicted to fail sit in end location of servers Heat map indicates drives at risk and you can issue drive tests (DST, IDD,…) to resolve or corroborate
  • 17. ENGINEERING WORKSHOP Find failure triggers Systematic failure predicted: 3 out of 5 drives predicted to fail sit in end location of servers Common factors for drives in the end position is a cooler temperature. Therefore increasing the server temperature may reduce the (dominant) failure mechanism and increase drive reliability Root cause tools including a temperature heat map can help you triage the cause of your drive issues
  • 18. ENGINEERING WORKSHOP Failure prediction lead time Currently catch 55-90% of failures ahead of time Case study 2, we predicted 5 drives to failed 23 days prior to failure, 2 drives 22 days prior to failure,… 2 drives just one day in advance Case study 1, we predicted most drives (118 drives) to fail 12 days prior to failure We can predict drives will fail on average 9-10 days before the failure
  • 20. ENGINEERING WORKSHOP •  Truly drive-centric management tool for the cloud •  Most efficient tool for extracting drive health information using Seagate IP ▪  Nobody knows drives better than us ▪  Freeware utilities are frequently wrong •  Runs on any Linux system with little overhead (<1%) Windows is next •  Data can be collected, monitored and analyzed locally or in the Cloud •  ReSTful API to interact with other software •  New analytics, prediction, AI, and control capabilities are added continually •  Drive repair will be possible with in-drive diagnostic •  Enclosure control will be possible by summer 2015 Simply SMARTer Competition Why Cloud GazerTM? *Seagate drives