Seagate relies heavily on big data analytics to ensure high quality in data storage. As data storage needs grow exponentially, predictive analytics are crucial to avoid costly failures. Seagate collects terabytes of manufacturing, testing, component, and field data daily. This data is analyzed using machine learning algorithms to predict and prevent drive failures, helping ensure the reliability of over 1 billion drives expected in cloud datacenters by 2020. Seagate's big data analytics infrastructure combines comprehensive data collection, large-scale analytics capabilities, and data-driven decision making to advance quality control in high-volume data storage manufacturing.
1. Big Data Analytics for High-Quality
Big Data Storage
Andrei Khurshudov, PhD
Chief Technologist
Analytics and Insights
Seagate
2015
Andrei.Khurshudov@seagate.com
2. 2
You May Know Seagate as a
Hard Drive
Manufacturer…
§ $14B in revenue
§ 50K+ employees worldwide
§ 1st to ship over 2 billion drives
§ Stores more than 40% of the world’s data
§ 43,000 Cloud services clients worldwide
2
But We’re Alsoa
Company That:
Relies heavily on
PREDICTIVE ANALYTICSAndrei.Khurshudov@seagate.com
4. 4Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Relevant Numbers
MAKING SO MANY HIGHLY RELIABLE STORAGE DEVICES BECOMES
IMPOSSIBLE WITHOUT BIG DATA ANALYTICS
• By 2020, 1 billion hard drives will be used in
cloud datacenters, highlighting the need for high-
quality data storage
• Statistically, 1 total outage per DC is expected
every year
• $700,000 is the average cost per incident
• $8,000 is the average cost per minute of an unplanned
outage
• Up to 10% of DC incidents are related to storage
56%
>1billion
drives
in cloud
Source: Seagate Strategic Marketing and Research 2013
2020
Andrei.Khurshudov@seagate.com
5. 5Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Evolution of Data in Quality
BIG DATA-DRIVEN QUALITY IS THE LATEST EVOLUTIONARY STEP
All 5 units
produced
today
work fine!
No data available
Let’s track a
few
parameters
that seem
important ...
Few charts
and tables
per week
1924... Let’s
impose some
control limits...
Things are
getting too
complex!
KBs of
quality data
per week
Automated
production
SPC + Excel,
Minitab, JMP,
SQL DB...
MB/week –
GB/week
E2E data
collection, Field
Telemetry +
Machine Learning,
Hadoop, Spark, ...
What is next?
TBs/day
Andrei.Khurshudov@seagate.com
TB/hour and
beyond
1001011...
6. 6
44ZB
Amount of data
that will be created
Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Data Universe: 2020
13ZB
Amount of data
that will be useful if stored
6.5ZB
Total amount of data that
installed capacity will be
able to hold worldwide
BUT THIS PRESENTATION IS ABOUT DIFFERENT DATA
Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April
2014
1 ZB = 103 EB = 106 PB = 109 TB = 1012 GB = 1021 Bytes
Andrei.Khurshudov@seagate.com
Largest available drive today is 10 TB
6.5ZB ~ 6.5x108 largest drives available today or 650,000,000 drives
7. 7
175,000,000Number of drives produced per year
Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Seagate Universe: 2015
100-200
Number of hours drive spends in
manufacturing tests
6
Drives produced every second
WE HAVE ALL THE DATA WE NEED TO ENABLE BIG DATA ANALYTICS
1,000+
Variables collected for each drive
produced
Also, number of drives per product in
extended test at any given time
100,000+Variables collected for the incoming
parts
30 +SMART variables collected for each
drive in the field over time
10+
MBs of health logs collected by
each drive itself in the field
1,000,000+
Data points collected from drive field
telemetry
Andrei.Khurshudov@seagate.com
8. 8
All Elements Working Together As One System:
Big Data-Driven Quality Concept: What is it?
End-to-End Coherent, Scalable Data Collection and Retention
Big Data Analytics Infrastructure (H/W + S/W) and Algorithms
Drive Quality
Engineering
and Assurance
Data
Drive
Assembly and
Manufacturing
Test Data
Incoming
Components
Data
Ongoing Quality
and Reliability
Test Data
Returned
Drives Test
and
Diagnostics
Data
Customer
Integration and
Field Data
(including Field
Telemetry)
Predictive Life
Models
Test auto-
Diagnostics
and Alerts
Predictive
Financial
Models
Robust
Excursion
Detection Algos
Ad-hoc Big
Data Analytics
Projects
In-situ Failure
Prediction
Big Data-Driven Quality Decision Layer
Andrei.Khurshudov@seagate.com
9. 9
Seagate Enterprise Analytics Infrastructure
EDW
Business Systems
(Sales, Logistics, Finance, etc.)
Factory, Quality Systems and Testers,
Component Suppliers
Field Data (including telemetry)
140 TB (usable)
Loads 450GB new
data daily
Most factory
data 100%,
some sampled
Dashboards &
Visualization
(Tableau)
Advanced Statistical
Analytics Tool Suite
Standard Reporting
(Business Objects)
Andrei.Khurshudov@seagate.com
Hadoop
Enterprise
Hadoop
Local Research
Hadoop
1.5PB
3.5 PB
Loads 1.5TB new data daily
Much longer retention of Factory Data
100% of
factory data
10. 10Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
What Are We Predicting/Explaining/Detecting? Examples
THESE TASKS REQUIRE ADVANCED ALGORITHMS, END-TO-END DATA COLLECTION,
AND POWERFULL ANALYTICS INFRASTRUCTURE
P (Passers)
F (Failures)
1 drive ~ 1,000
attributes
Given: 5M drives in total, 5K Fail (0.1%)
Task: Explain the difference between P and F
and build a predictive model
CONSTRAINT: Model miss-registration rate<< Detection rate
EXAMPLE 1
P (Passers)
F (Failures) @ t0
1 drive ~ 1,000
attributes
Given: 5M drives in total, 5K Fail @ t0
Task: Predict future drive failures at t1
1) “weak model” ~ Predict % of the population failed at t1
2) “strong model” ~ Predict individual drives to fail at t1
CONSTRAINT: Model miss-registration rate<< Detection rate
EXAMPLE 2
F (Failures) @ t1
Andrei.Khurshudov@seagate.com
11. 11
An Example: Random Forest Advantages
1) “Test data set” is available for model verification
2) “Confusion matrix” is available to check the
goodness of the model
3) High robustness for low-quality and incomplete
data sets
Disadvantages
1) “Black box” model – difficult to understand its
predictions
Example of Model Self-Test:
Failure Prediction
Passed Failed
Pred. Correct 2070 391
Pred. Wrong 3 20
False Rate % 0.1% 4.9%
Correct rate % 99.9% 95.1%
EXAMPLE 1
Andrei.Khurshudov@seagate.com
Original
Data
Test
Data
(30%)
Decision
and
Accuracy
Test
Data
(30%)
Decision
and
Accuracy
Test
Data
(30%)
Decision
and
Accuracy
Randomize
T1 T2 TN
...
...
...
...
D1
(70%)
D2
(70%)
DN
(70%)
12. 12
An Example: Drive Failure Prediction
Healthy Failed
Failure Prediction Example:
Drive parametric data vs. Time
FAILURE PREDICTION IS BASED ON AN ENSEMBLE OF ML ALGORITHMS
Making “by-drive” Predictions in real time
EXAMPLE 3
Andrei.Khurshudov@seagate.com
13. 13
Near-term failure predicted
Cluster “heat map” indicates drives at risk
An Example: Real-Time Drive Failure Prediction
Customer Drives Detection
Rate, %
False
Detection
Rate, %
A 8,000 90 <2.5
B 10,000 80 <1.5
Failure prediction in data center production
environment
MODEL WORKS AND CAN BE TUNED TO SPECIFIC NEEDS
Andrei.Khurshudov@seagate.com
14. 14
Summary
• BIG DATA-DRIVEN QUALITY IS A REQUIREMENT FOR ANY LEADING
HIGH-VOLUME TECHNOLOGY COMPANY
• SEAGATE’S BIG DATA-DRIVEN QUALITY COMBINES:
• END-TO-END COHERENT, SCALABLE DATA COLLECTION AND RETENTION
• BIG DATA ANALYTICS INFRASTRUCTURE (H/W + S/W) AND ALGORITHMS
• DATA-DRIVEN QUALITY DECISION-MAKING PROCESS
• ADVANCED PHYSICAL AND STATISTICAL MODELS
Andrei.Khurshudov@seagate.com
15. 15Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Evolution of Data in Quality
THANK YOU!
All 5 units
produced
today
work fine!
No data available
Let’s track a
few
parameters
that seem
important ...
Few charts
and tables
per week
1924... Let’s
impose some
control limits...
Things are
getting too
complex!
KBs of
quality data
per week
Automated
production
SPC + Excel,
Minitab, JMP,
SQL DB...
MB/week –
GB/week
E2E data
collection, Field
Telemetry +
Machine Learning,
Hadoop, Spark, ...
What is next?
TBs/day
Andrei.Khurshudov@seagate.com
TBs/hour and
beyond
1001011...