SlideShare a Scribd company logo
Big Data Analytics for High-Quality
Big Data Storage
Andrei Khurshudov, PhD
Chief Technologist
Analytics and Insights
Seagate
2015
Andrei.Khurshudov@seagate.com
2
You May Know Seagate as a
Hard Drive
Manufacturer…
§  $14B in revenue
§  50K+ employees worldwide
§  1st to ship over 2 billion drives
§  Stores more than 40% of the world’s data
§  43,000 Cloud services clients worldwide
2
But We’re Alsoa
Company That:
Relies heavily on
PREDICTIVE ANALYTICSAndrei.Khurshudov@seagate.com
Seagate Confidential
Presentation Objectives
•  Present our Big Data-driven Quality vision
•  Explain the need for Predictive Analytics in Data Storage
Design, Production, and Quality
4Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Relevant Numbers
MAKING SO MANY HIGHLY RELIABLE STORAGE DEVICES BECOMES
IMPOSSIBLE WITHOUT BIG DATA ANALYTICS
•  By 2020, 1 billion hard drives will be used in
cloud datacenters, highlighting the need for high-
quality data storage
•  Statistically, 1 total outage per DC is expected
every year
•  $700,000 is the average cost per incident
•  $8,000 is the average cost per minute of an unplanned
outage
•  Up to 10% of DC incidents are related to storage
56%
>1billion
drives
in cloud
Source: Seagate Strategic Marketing and Research 2013
2020
Andrei.Khurshudov@seagate.com
5Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Evolution of Data in Quality
BIG DATA-DRIVEN QUALITY IS THE LATEST EVOLUTIONARY STEP
All 5 units
produced
today
work fine!
No data available
Let’s track a
few
parameters
that seem
important ...
Few charts
and tables
per week
1924... Let’s
impose some
control limits...
Things are
getting too
complex!
KBs of
quality data
per week
Automated
production
SPC + Excel,
Minitab, JMP,
SQL DB...
MB/week –
GB/week
E2E data
collection, Field
Telemetry +
Machine Learning,
Hadoop, Spark, ...
What is next?
TBs/day
Andrei.Khurshudov@seagate.com
TB/hour and
beyond
1001011...
6
44ZB
Amount of data
that will be created
Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Data Universe: 2020
13ZB
Amount of data
that will be useful if stored
6.5ZB
Total amount of data that
installed capacity will be
able to hold worldwide
BUT THIS PRESENTATION IS ABOUT DIFFERENT DATA
Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April
2014
1 ZB = 103 EB = 106 PB = 109 TB = 1012 GB = 1021 Bytes
Andrei.Khurshudov@seagate.com
Largest available drive today is 10 TB
6.5ZB ~ 6.5x108 largest drives available today or 650,000,000 drives
7
175,000,000Number of drives produced per year
Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Seagate Universe: 2015
100-200
Number of hours drive spends in
manufacturing tests
6
Drives produced every second
WE HAVE ALL THE DATA WE NEED TO ENABLE BIG DATA ANALYTICS
1,000+
Variables collected for each drive
produced
Also, number of drives per product in
extended test at any given time
100,000+Variables collected for the incoming
parts
30 +SMART variables collected for each
drive in the field over time
10+
MBs of health logs collected by
each drive itself in the field
1,000,000+
Data points collected from drive field
telemetry
Andrei.Khurshudov@seagate.com
8
All Elements Working Together As One System:
Big Data-Driven Quality Concept: What is it?
End-to-End Coherent, Scalable Data Collection and Retention
Big Data Analytics Infrastructure (H/W + S/W) and Algorithms
Drive Quality
Engineering
and Assurance
Data
Drive
Assembly and
Manufacturing
Test Data
Incoming
Components
Data
Ongoing Quality
and Reliability
Test Data
Returned
Drives Test
and
Diagnostics
Data
Customer
Integration and
Field Data
(including Field
Telemetry)
Predictive Life
Models
Test auto-
Diagnostics
and Alerts
Predictive
Financial
Models
Robust
Excursion
Detection Algos
Ad-hoc Big
Data Analytics
Projects
In-situ Failure
Prediction
Big Data-Driven Quality Decision Layer
Andrei.Khurshudov@seagate.com
9
Seagate Enterprise Analytics Infrastructure
EDW
Business Systems
(Sales, Logistics, Finance, etc.)
Factory, Quality Systems and Testers,
Component Suppliers
Field Data (including telemetry)
140 TB (usable)
Loads 450GB new
data daily
Most factory
data 100%,
some sampled
Dashboards &
Visualization
(Tableau)
Advanced Statistical
Analytics Tool Suite
Standard Reporting
(Business Objects)
Andrei.Khurshudov@seagate.com
Hadoop
Enterprise
Hadoop
Local Research
Hadoop
1.5PB
3.5 PB
Loads 1.5TB new data daily
Much longer retention of Factory Data
100% of
factory data
10Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
What Are We Predicting/Explaining/Detecting? Examples
THESE TASKS REQUIRE ADVANCED ALGORITHMS, END-TO-END DATA COLLECTION,
AND POWERFULL ANALYTICS INFRASTRUCTURE
P (Passers)
F (Failures)
1 drive ~ 1,000
attributes
Given: 5M drives in total, 5K Fail (0.1%)
Task: Explain the difference between P and F
and build a predictive model
CONSTRAINT: Model miss-registration rate<< Detection rate
EXAMPLE 1
P (Passers)
F (Failures) @ t0
1 drive ~ 1,000
attributes
Given: 5M drives in total, 5K Fail @ t0
Task: Predict future drive failures at t1
1)  “weak model” ~ Predict % of the population failed at t1
2)  “strong model” ~ Predict individual drives to fail at t1
CONSTRAINT: Model miss-registration rate<< Detection rate
EXAMPLE 2
F (Failures) @ t1
Andrei.Khurshudov@seagate.com
11
An Example: Random Forest Advantages
1) “Test data set” is available for model verification
2) “Confusion matrix” is available to check the
goodness of the model
3) High robustness for low-quality and incomplete
data sets
Disadvantages
1) “Black box” model – difficult to understand its
predictions
Example of Model Self-Test:
Failure Prediction
Passed Failed
Pred. Correct 2070 391
Pred. Wrong 3 20
False Rate % 0.1% 4.9%
Correct rate % 99.9% 95.1%
EXAMPLE 1
Andrei.Khurshudov@seagate.com
Original
Data
Test
Data
(30%)
Decision
and
Accuracy
Test
Data
(30%)
Decision
and
Accuracy
Test
Data
(30%)
Decision
and
Accuracy
Randomize
T1 T2 TN
...
...
...
...
D1
(70%)
D2
(70%)
DN
(70%)
12
An Example: Drive Failure Prediction
Healthy Failed
Failure Prediction Example:
Drive parametric data vs. Time
FAILURE PREDICTION IS BASED ON AN ENSEMBLE OF ML ALGORITHMS
Making “by-drive” Predictions in real time
EXAMPLE 3
Andrei.Khurshudov@seagate.com
13
Near-term failure predicted
Cluster “heat map” indicates drives at risk
An Example: Real-Time Drive Failure Prediction
Customer Drives Detection
Rate, %
False
Detection
Rate, %
A 8,000 90 <2.5
B 10,000 80 <1.5
Failure prediction in data center production
environment
MODEL WORKS AND CAN BE TUNED TO SPECIFIC NEEDS
Andrei.Khurshudov@seagate.com
14
Summary
•  BIG DATA-DRIVEN QUALITY IS A REQUIREMENT FOR ANY LEADING
HIGH-VOLUME TECHNOLOGY COMPANY
•  SEAGATE’S BIG DATA-DRIVEN QUALITY COMBINES:
•  END-TO-END COHERENT, SCALABLE DATA COLLECTION AND RETENTION
•  BIG DATA ANALYTICS INFRASTRUCTURE (H/W + S/W) AND ALGORITHMS
•  DATA-DRIVEN QUALITY DECISION-MAKING PROCESS
•  ADVANCED PHYSICAL AND STATISTICAL MODELS
Andrei.Khurshudov@seagate.com
15Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates
Evolution of Data in Quality
THANK YOU!
All 5 units
produced
today
work fine!
No data available
Let’s track a
few
parameters
that seem
important ...
Few charts
and tables
per week
1924... Let’s
impose some
control limits...
Things are
getting too
complex!
KBs of
quality data
per week
Automated
production
SPC + Excel,
Minitab, JMP,
SQL DB...
MB/week –
GB/week
E2E data
collection, Field
Telemetry +
Machine Learning,
Hadoop, Spark, ...
What is next?
TBs/day
Andrei.Khurshudov@seagate.com
TBs/hour and
beyond
1001011...

More Related Content

What's hot

Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
Christian Hansen case
Christian Hansen caseChristian Hansen case
Christian Hansen caseMicrosoft
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckPistoia Alliance
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Benjamin Taylor
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Machine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's NextMachine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's NextPointClear Solutions
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides SlideTeam
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 

What's hot (20)

Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
Christian Hansen case
Christian Hansen caseChristian Hansen case
Christian Hansen case
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Machine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's NextMachine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's Next
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 

Viewers also liked

Mobile UX Research: Travel Consumer Preferences for Mobile and Tablet
Mobile UX Research: Travel Consumer Preferences for Mobile and TabletMobile UX Research: Travel Consumer Preferences for Mobile and Tablet
Mobile UX Research: Travel Consumer Preferences for Mobile and TabletUserZoom
 
CHPC Workshop Morning Session
CHPC Workshop Morning SessionCHPC Workshop Morning Session
CHPC Workshop Morning SessionNtino Krampis
 
AtticTV Pte. Ltd. Strategy Slide
AtticTV Pte. Ltd. Strategy SlideAtticTV Pte. Ltd. Strategy Slide
AtticTV Pte. Ltd. Strategy SlideJohnson Goh
 
Resources
ResourcesResources
Resourcesjactlc
 
KY Humane and The Social Web
KY Humane and The Social WebKY Humane and The Social Web
KY Humane and The Social Webjackbr4
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsNtino Krampis
 
How to Improve Your Company's UX Capabilities - Let Your Methods Drive Your Plan
How to Improve Your Company's UX Capabilities - Let Your Methods Drive Your PlanHow to Improve Your Company's UX Capabilities - Let Your Methods Drive Your Plan
How to Improve Your Company's UX Capabilities - Let Your Methods Drive Your PlanUserZoom
 
Internet e social network: ne abbiamo piene le tasche
Internet e social network: ne abbiamo piene le tascheInternet e social network: ne abbiamo piene le tasche
Internet e social network: ne abbiamo piene le tascheDenis Ferraretti
 
Balangero Asbestos Tailings Dump Environmental Rehabilitation
Balangero Asbestos Tailings Dump Environmental RehabilitationBalangero Asbestos Tailings Dump Environmental Rehabilitation
Balangero Asbestos Tailings Dump Environmental RehabilitationOboni Riskope Associates Inc.
 
Line Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
Line Upgrade Deferral Scenarios for Distributed Renewable Energy ResourcesLine Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
Line Upgrade Deferral Scenarios for Distributed Renewable Energy ResourcesIain Sanders
 
Cima Samples
Cima SamplesCima Samples
Cima Samplesccima
 
Nastas Presentation, Int'l Partners & Network Creation
Nastas Presentation, Int'l Partners & Network CreationNastas Presentation, Int'l Partners & Network Creation
Nastas Presentation, Int'l Partners & Network CreationThomas Nastas
 
What Is Windows Azure
What Is Windows AzureWhat Is Windows Azure
What Is Windows AzureDominic Green
 

Viewers also liked (20)

Seagate_1
Seagate_1Seagate_1
Seagate_1
 
Flamingo project v4
Flamingo project v4Flamingo project v4
Flamingo project v4
 
Mobile UX Research: Travel Consumer Preferences for Mobile and Tablet
Mobile UX Research: Travel Consumer Preferences for Mobile and TabletMobile UX Research: Travel Consumer Preferences for Mobile and Tablet
Mobile UX Research: Travel Consumer Preferences for Mobile and Tablet
 
CHPC Workshop Morning Session
CHPC Workshop Morning SessionCHPC Workshop Morning Session
CHPC Workshop Morning Session
 
AtticTV Pte. Ltd. Strategy Slide
AtticTV Pte. Ltd. Strategy SlideAtticTV Pte. Ltd. Strategy Slide
AtticTV Pte. Ltd. Strategy Slide
 
Resources
ResourcesResources
Resources
 
KY Humane and The Social Web
KY Humane and The Social WebKY Humane and The Social Web
KY Humane and The Social Web
 
2011 CANARIE User's Forum
2011 CANARIE User's Forum2011 CANARIE User's Forum
2011 CANARIE User's Forum
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
 
Geek
GeekGeek
Geek
 
How to Improve Your Company's UX Capabilities - Let Your Methods Drive Your Plan
How to Improve Your Company's UX Capabilities - Let Your Methods Drive Your PlanHow to Improve Your Company's UX Capabilities - Let Your Methods Drive Your Plan
How to Improve Your Company's UX Capabilities - Let Your Methods Drive Your Plan
 
Internet e social network: ne abbiamo piene le tasche
Internet e social network: ne abbiamo piene le tascheInternet e social network: ne abbiamo piene le tasche
Internet e social network: ne abbiamo piene le tasche
 
Cloud ntino-krampis
Cloud ntino-krampisCloud ntino-krampis
Cloud ntino-krampis
 
Balangero Asbestos Tailings Dump Environmental Rehabilitation
Balangero Asbestos Tailings Dump Environmental RehabilitationBalangero Asbestos Tailings Dump Environmental Rehabilitation
Balangero Asbestos Tailings Dump Environmental Rehabilitation
 
Line Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
Line Upgrade Deferral Scenarios for Distributed Renewable Energy ResourcesLine Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
Line Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
 
Referaat 31 05 2011
Referaat 31 05 2011Referaat 31 05 2011
Referaat 31 05 2011
 
Cima Samples
Cima SamplesCima Samples
Cima Samples
 
Nastas Presentation, Int'l Partners & Network Creation
Nastas Presentation, Int'l Partners & Network CreationNastas Presentation, Int'l Partners & Network Creation
Nastas Presentation, Int'l Partners & Network Creation
 
Cim2013 oboni oboni_zabolotoniuk
Cim2013 oboni oboni_zabolotoniukCim2013 oboni oboni_zabolotoniuk
Cim2013 oboni oboni_zabolotoniuk
 
What Is Windows Azure
What Is Windows AzureWhat Is Windows Azure
What Is Windows Azure
 

Similar to Presentation_Final

Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance ComputingAbzetdin Adamov
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecJonathan Woodward
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스Amazon Web Services Korea
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningAbzetdin Adamov
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 

Similar to Presentation_Final (20)

Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd Dec
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
 
Using Big Data Analytics
Using Big Data AnalyticsUsing Big Data Analytics
Using Big Data Analytics
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 

More from Andrei Khurshudov

Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...Andrei Khurshudov
 
Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...Andrei Khurshudov
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterAndrei Khurshudov
 
clusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetclusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetAndrei Khurshudov
 
Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007Andrei Khurshudov
 
Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008Andrei Khurshudov
 

More from Andrei Khurshudov (7)

Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
 
Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...Short introduction to Big Data Analytics, the Internet of Things, and their s...
Short introduction to Big Data Analytics, the Internet of Things, and their s...
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
 
clusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetclusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheet
 
Long Term Data Storage 2007
Long Term Data Storage 2007Long Term Data Storage 2007
Long Term Data Storage 2007
 
Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007Future Information Growth And Storage Device Reliability 2007
Future Information Growth And Storage Device Reliability 2007
 
Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008Reliability Of Solid State Drives 2008
Reliability Of Solid State Drives 2008
 

Presentation_Final

  • 1. Big Data Analytics for High-Quality Big Data Storage Andrei Khurshudov, PhD Chief Technologist Analytics and Insights Seagate 2015 Andrei.Khurshudov@seagate.com
  • 2. 2 You May Know Seagate as a Hard Drive Manufacturer… §  $14B in revenue §  50K+ employees worldwide §  1st to ship over 2 billion drives §  Stores more than 40% of the world’s data §  43,000 Cloud services clients worldwide 2 But We’re Alsoa Company That: Relies heavily on PREDICTIVE ANALYTICSAndrei.Khurshudov@seagate.com
  • 3. Seagate Confidential Presentation Objectives •  Present our Big Data-driven Quality vision •  Explain the need for Predictive Analytics in Data Storage Design, Production, and Quality
  • 4. 4Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates Relevant Numbers MAKING SO MANY HIGHLY RELIABLE STORAGE DEVICES BECOMES IMPOSSIBLE WITHOUT BIG DATA ANALYTICS •  By 2020, 1 billion hard drives will be used in cloud datacenters, highlighting the need for high- quality data storage •  Statistically, 1 total outage per DC is expected every year •  $700,000 is the average cost per incident •  $8,000 is the average cost per minute of an unplanned outage •  Up to 10% of DC incidents are related to storage 56% >1billion drives in cloud Source: Seagate Strategic Marketing and Research 2013 2020 Andrei.Khurshudov@seagate.com
  • 5. 5Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates Evolution of Data in Quality BIG DATA-DRIVEN QUALITY IS THE LATEST EVOLUTIONARY STEP All 5 units produced today work fine! No data available Let’s track a few parameters that seem important ... Few charts and tables per week 1924... Let’s impose some control limits... Things are getting too complex! KBs of quality data per week Automated production SPC + Excel, Minitab, JMP, SQL DB... MB/week – GB/week E2E data collection, Field Telemetry + Machine Learning, Hadoop, Spark, ... What is next? TBs/day Andrei.Khurshudov@seagate.com TB/hour and beyond 1001011...
  • 6. 6 44ZB Amount of data that will be created Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates Data Universe: 2020 13ZB Amount of data that will be useful if stored 6.5ZB Total amount of data that installed capacity will be able to hold worldwide BUT THIS PRESENTATION IS ABOUT DIFFERENT DATA Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 1 ZB = 103 EB = 106 PB = 109 TB = 1012 GB = 1021 Bytes Andrei.Khurshudov@seagate.com Largest available drive today is 10 TB 6.5ZB ~ 6.5x108 largest drives available today or 650,000,000 drives
  • 7. 7 175,000,000Number of drives produced per year Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates Seagate Universe: 2015 100-200 Number of hours drive spends in manufacturing tests 6 Drives produced every second WE HAVE ALL THE DATA WE NEED TO ENABLE BIG DATA ANALYTICS 1,000+ Variables collected for each drive produced Also, number of drives per product in extended test at any given time 100,000+Variables collected for the incoming parts 30 +SMART variables collected for each drive in the field over time 10+ MBs of health logs collected by each drive itself in the field 1,000,000+ Data points collected from drive field telemetry Andrei.Khurshudov@seagate.com
  • 8. 8 All Elements Working Together As One System: Big Data-Driven Quality Concept: What is it? End-to-End Coherent, Scalable Data Collection and Retention Big Data Analytics Infrastructure (H/W + S/W) and Algorithms Drive Quality Engineering and Assurance Data Drive Assembly and Manufacturing Test Data Incoming Components Data Ongoing Quality and Reliability Test Data Returned Drives Test and Diagnostics Data Customer Integration and Field Data (including Field Telemetry) Predictive Life Models Test auto- Diagnostics and Alerts Predictive Financial Models Robust Excursion Detection Algos Ad-hoc Big Data Analytics Projects In-situ Failure Prediction Big Data-Driven Quality Decision Layer Andrei.Khurshudov@seagate.com
  • 9. 9 Seagate Enterprise Analytics Infrastructure EDW Business Systems (Sales, Logistics, Finance, etc.) Factory, Quality Systems and Testers, Component Suppliers Field Data (including telemetry) 140 TB (usable) Loads 450GB new data daily Most factory data 100%, some sampled Dashboards & Visualization (Tableau) Advanced Statistical Analytics Tool Suite Standard Reporting (Business Objects) Andrei.Khurshudov@seagate.com Hadoop Enterprise Hadoop Local Research Hadoop 1.5PB 3.5 PB Loads 1.5TB new data daily Much longer retention of Factory Data 100% of factory data
  • 10. 10Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates What Are We Predicting/Explaining/Detecting? Examples THESE TASKS REQUIRE ADVANCED ALGORITHMS, END-TO-END DATA COLLECTION, AND POWERFULL ANALYTICS INFRASTRUCTURE P (Passers) F (Failures) 1 drive ~ 1,000 attributes Given: 5M drives in total, 5K Fail (0.1%) Task: Explain the difference between P and F and build a predictive model CONSTRAINT: Model miss-registration rate<< Detection rate EXAMPLE 1 P (Passers) F (Failures) @ t0 1 drive ~ 1,000 attributes Given: 5M drives in total, 5K Fail @ t0 Task: Predict future drive failures at t1 1)  “weak model” ~ Predict % of the population failed at t1 2)  “strong model” ~ Predict individual drives to fail at t1 CONSTRAINT: Model miss-registration rate<< Detection rate EXAMPLE 2 F (Failures) @ t1 Andrei.Khurshudov@seagate.com
  • 11. 11 An Example: Random Forest Advantages 1) “Test data set” is available for model verification 2) “Confusion matrix” is available to check the goodness of the model 3) High robustness for low-quality and incomplete data sets Disadvantages 1) “Black box” model – difficult to understand its predictions Example of Model Self-Test: Failure Prediction Passed Failed Pred. Correct 2070 391 Pred. Wrong 3 20 False Rate % 0.1% 4.9% Correct rate % 99.9% 95.1% EXAMPLE 1 Andrei.Khurshudov@seagate.com Original Data Test Data (30%) Decision and Accuracy Test Data (30%) Decision and Accuracy Test Data (30%) Decision and Accuracy Randomize T1 T2 TN ... ... ... ... D1 (70%) D2 (70%) DN (70%)
  • 12. 12 An Example: Drive Failure Prediction Healthy Failed Failure Prediction Example: Drive parametric data vs. Time FAILURE PREDICTION IS BASED ON AN ENSEMBLE OF ML ALGORITHMS Making “by-drive” Predictions in real time EXAMPLE 3 Andrei.Khurshudov@seagate.com
  • 13. 13 Near-term failure predicted Cluster “heat map” indicates drives at risk An Example: Real-Time Drive Failure Prediction Customer Drives Detection Rate, % False Detection Rate, % A 8,000 90 <2.5 B 10,000 80 <1.5 Failure prediction in data center production environment MODEL WORKS AND CAN BE TUNED TO SPECIFIC NEEDS Andrei.Khurshudov@seagate.com
  • 14. 14 Summary •  BIG DATA-DRIVEN QUALITY IS A REQUIREMENT FOR ANY LEADING HIGH-VOLUME TECHNOLOGY COMPANY •  SEAGATE’S BIG DATA-DRIVEN QUALITY COMBINES: •  END-TO-END COHERENT, SCALABLE DATA COLLECTION AND RETENTION •  BIG DATA ANALYTICS INFRASTRUCTURE (H/W + S/W) AND ALGORITHMS •  DATA-DRIVEN QUALITY DECISION-MAKING PROCESS •  ADVANCED PHYSICAL AND STATISTICAL MODELS Andrei.Khurshudov@seagate.com
  • 15. 15Sources: Reinsel, David. “Where in the World Is Storage: A Look at Byte Density Across the Globe” IDC October 2013, IDC/EMC Digital Universe, April 2014 + Seagate Estimates Evolution of Data in Quality THANK YOU! All 5 units produced today work fine! No data available Let’s track a few parameters that seem important ... Few charts and tables per week 1924... Let’s impose some control limits... Things are getting too complex! KBs of quality data per week Automated production SPC + Excel, Minitab, JMP, SQL DB... MB/week – GB/week E2E data collection, Field Telemetry + Machine Learning, Hadoop, Spark, ... What is next? TBs/day Andrei.Khurshudov@seagate.com TBs/hour and beyond 1001011...