SlideShare a Scribd company logo
1 of 20
Download to read offline
Disk Health Prediction
for Ceph
Brian Jeng, brian.jeng@prophetstor.com
Copyright © 2018 by ProphetStor Data Services, Inc. All Rights Reserved.
Brian Jeng –
brian.jeng@prophetstor.com
2
Ceph Pain Points
• Rebalancing drops
performance of a cluster
for an extended period of
time
• Limited visibility of OSD
health before failure
• Limited predictive
analytics
• Utilizes programmatic
intelligence rather than
machine learning
Slow
drives
Failed
drives
Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery
impacting client IO
OSD map changes due to loss of disk, resulting in PG peering
and backfilling
Results: Clients encounter impeded and slow IO.
Unbalanced data
distribution
Data on physical disks isn’t evenly distributed. Cluster may
be 50% full, but some disks are at 90%
Results: Backfill isn’t always able to complete.
Slow disk impacting
client IO
A single slow (impaired, not dead) OSD can severely impact
many clients until it’s ejected from the cluster.
Results: Client have slow or blocked IO.
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco
10/18/2016
Evolve Ceph Intelligence
4
Programmatic Intelligence Machine Learning
• Pre-defined rules
• Reactive solution
• Self-learning & self-growing knowledge
• Proactive solution
Past Future
• Problem
• RBD image data
distributed to all disks, but
single disk failure can
impact critical data IO
• Solution:
• predict future disk failure
(proactive)
• DiskProphet Solution
• Disk near-failure likelihood
prediction
• Disk life-expectancy
prediction
• Actions to optimize Ceph
Preemptive Detection of Disk Failure
Normal
workload
1 OSD failed, Ceph’s
rebalancing
1 OSD failure predicted,
No-Impact Recovery by
DiskProphet
IOPS
Time
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco
10/18/2016
Benefits and Resolutions
Without Disk Prediction With Disk Prediction
Impact Time Days (subject to cluster size) 90% less performance degradation
Real World Prediction Results
7
Useful Reference: Stabilizing PB Ceph Cluster (a Cisco case)
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-
cloud
Duration: 2016/01/01 – 2016/03/31 (90 days)
Number of Drives: over 20,000
Average Accuracy 96.1%,Recall 97%
Disk Prediction Plugin for Ceph
8
• Add Disk Prediction Plugin Service - https://github.com/ceph/ceph/pull/22239
• Add OSD Device Health Prediction - https://github.com/ceph/ceph/pull/22785
• DeviceHealth Module - https://github.com/ceph/ceph/pull/22479
Architecture
DiskPrediction Plugin
Local prediction
module
Cloud Prediction Server
DeviceHealth Plugin
mgr
OSD
Device
health
Health monitor actions:
• OSD health warning
• Mark OSD out
• …
Prediction
result
Ceph
metrics
Collected data
• Ceph cluster info/health state
• Ceph mon/osd/mds metadata
• Ceph osd/pool performance dump
• Ceph mon/osd/msd/osd/pool correlation
• OSD physical device health data (collected by devicehealth
plugin)
Disk Prediction Modes
• Cloud – predicted by the cloud prediction server.
• All device health data
• Ceph performance counter data
• Local – predicted by the plugin prediction module.
• Some attributes of the device health data.
DiskPrediction feedback
• DiskPrediction plugin writes prediction result into the device info
• Ceph device set life expectancy command
# ceph device set-life-expectancy <devid> <from> {<to>}
• Ceph device show life expectancy
# ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
TOSHIBA_DT01ACA050_44T2E4LAS devcnode1:sdb osd.0 >6w
Beyond Disk Failure Prediction
04Correlation Map with
Ceph
03Bad Sector, Slow
Drive & Anomaly
Detection
Disk Health and
Failure Prediction
01
Performance
Baseline,
Degradation &
Capacity
Prediction
02
DiskProphet
Features
DiskProphet Versions for CEPH
14
Community On-Prem Community Cloud Commercial Edition
Disk failure
prediction
Yes Yes Yes
Accuracy ★★★★ ★★★★★ ★★★★★
Confidence ★★★ ★★★★★ ★★★★★
Performance
prediction
No No Yes
Anomaly & Bad
Sector detection
No No Yes
Replacement
time
No No Yes
Correlation
analysis with
Ceph
No No Yes
Availability Trial Trial Now
Services
15
DiskProphet Cloud • Annual subscription fee
• Charges by nodes or disks
• Starts from 10 nodes or 50 disks
On-Premise
• Professional service charge for deployment
• Annual subscription fee
Integration
Services
16
17
18
19
Thank You
20
Brian Jeng –
brian.jeng@prophetstor.com

More Related Content

What's hot

A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCeph Community
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Community
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 
Oracle cloud infrastructure shared file service comparison 20181019 ss
Oracle cloud infrastructure shared file service comparison 20181019 ssOracle cloud infrastructure shared file service comparison 20181019 ss
Oracle cloud infrastructure shared file service comparison 20181019 ssKenichi Sonoda
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on LinuxPawan Kumar
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기NHN FORWARD
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018
Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018
Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018Netgate
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 

What's hot (20)

A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Oracle cloud infrastructure shared file service comparison 20181019 ss
Oracle cloud infrastructure shared file service comparison 20181019 ssOracle cloud infrastructure shared file service comparison 20181019 ss
Oracle cloud infrastructure shared file service comparison 20181019 ss
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018
Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018
Monitoring pfSense 2.4 with SNMP - pfSense Hangout March 2018
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph
CephCeph
Ceph
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 

Similar to Disk health prediction for Ceph

CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...Ceph Community
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterAndrei Khurshudov
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.pptnadirpervez2
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUGSandesh Rao
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloudLocuto Riorama
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresJorge Cardoso
 
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The SequelVMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The SequelVMworld
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red_Hat_Storage
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre AnforderungenBlock & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre AnforderungenNEXTtour
 
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...Insight Technology, Inc.
 
Database Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12cDatabase Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12cHari Srinivasan
 
SoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo LabSoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo LabValentin Kropov
 
Automate DG Best Practices
Automate DG  Best PracticesAutomate DG  Best Practices
Automate DG Best PracticesMohsen B
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperationsLocuto Riorama
 
Oracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c PresentationOracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c PresentationFrancisco Alvarez
 
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Skytap Cloud
 
Tips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the ServerTips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the ServerSenturus
 

Similar to Disk health prediction for Ceph (20)

CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The SequelVMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre AnforderungenBlock & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
 
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
 
Database Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12cDatabase Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12c
 
SoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo LabSoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo Lab
 
Automate DG Best Practices
Automate DG  Best PracticesAutomate DG  Best Practices
Automate DG Best Practices
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations
 
Oracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c PresentationOracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c Presentation
 
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
 
Tips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the ServerTips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the Server
 

Recently uploaded

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 

Recently uploaded (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 

Disk health prediction for Ceph

  • 1. Disk Health Prediction for Ceph Brian Jeng, brian.jeng@prophetstor.com Copyright © 2018 by ProphetStor Data Services, Inc. All Rights Reserved. Brian Jeng – brian.jeng@prophetstor.com
  • 2. 2 Ceph Pain Points • Rebalancing drops performance of a cluster for an extended period of time • Limited visibility of OSD health before failure • Limited predictive analytics • Utilizes programmatic intelligence rather than machine learning Slow drives Failed drives
  • 3. Major Stability Problems: Cluster Problem Impact Backfill & Recovery impacting client IO OSD map changes due to loss of disk, resulting in PG peering and backfilling Results: Clients encounter impeded and slow IO. Unbalanced data distribution Data on physical disks isn’t evenly distributed. Cluster may be 50% full, but some disks are at 90% Results: Backfill isn’t always able to complete. Slow disk impacting client IO A single slow (impaired, not dead) OSD can severely impact many clients until it’s ejected from the cluster. Results: Client have slow or blocked IO. https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco 10/18/2016
  • 4. Evolve Ceph Intelligence 4 Programmatic Intelligence Machine Learning • Pre-defined rules • Reactive solution • Self-learning & self-growing knowledge • Proactive solution Past Future
  • 5. • Problem • RBD image data distributed to all disks, but single disk failure can impact critical data IO • Solution: • predict future disk failure (proactive) • DiskProphet Solution • Disk near-failure likelihood prediction • Disk life-expectancy prediction • Actions to optimize Ceph Preemptive Detection of Disk Failure Normal workload 1 OSD failed, Ceph’s rebalancing 1 OSD failure predicted, No-Impact Recovery by DiskProphet IOPS Time https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco 10/18/2016
  • 6. Benefits and Resolutions Without Disk Prediction With Disk Prediction Impact Time Days (subject to cluster size) 90% less performance degradation
  • 7. Real World Prediction Results 7 Useful Reference: Stabilizing PB Ceph Cluster (a Cisco case) https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack- cloud Duration: 2016/01/01 – 2016/03/31 (90 days) Number of Drives: over 20,000 Average Accuracy 96.1%,Recall 97%
  • 8. Disk Prediction Plugin for Ceph 8 • Add Disk Prediction Plugin Service - https://github.com/ceph/ceph/pull/22239 • Add OSD Device Health Prediction - https://github.com/ceph/ceph/pull/22785 • DeviceHealth Module - https://github.com/ceph/ceph/pull/22479
  • 9. Architecture DiskPrediction Plugin Local prediction module Cloud Prediction Server DeviceHealth Plugin mgr OSD Device health Health monitor actions: • OSD health warning • Mark OSD out • … Prediction result Ceph metrics
  • 10. Collected data • Ceph cluster info/health state • Ceph mon/osd/mds metadata • Ceph osd/pool performance dump • Ceph mon/osd/msd/osd/pool correlation • OSD physical device health data (collected by devicehealth plugin)
  • 11. Disk Prediction Modes • Cloud – predicted by the cloud prediction server. • All device health data • Ceph performance counter data • Local – predicted by the plugin prediction module. • Some attributes of the device health data.
  • 12. DiskPrediction feedback • DiskPrediction plugin writes prediction result into the device info • Ceph device set life expectancy command # ceph device set-life-expectancy <devid> <from> {<to>} • Ceph device show life expectancy # ceph device ls DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY TOSHIBA_DT01ACA050_44T2E4LAS devcnode1:sdb osd.0 >6w
  • 13. Beyond Disk Failure Prediction 04Correlation Map with Ceph 03Bad Sector, Slow Drive & Anomaly Detection Disk Health and Failure Prediction 01 Performance Baseline, Degradation & Capacity Prediction 02 DiskProphet Features
  • 14. DiskProphet Versions for CEPH 14 Community On-Prem Community Cloud Commercial Edition Disk failure prediction Yes Yes Yes Accuracy ★★★★ ★★★★★ ★★★★★ Confidence ★★★ ★★★★★ ★★★★★ Performance prediction No No Yes Anomaly & Bad Sector detection No No Yes Replacement time No No Yes Correlation analysis with Ceph No No Yes Availability Trial Trial Now
  • 15. Services 15 DiskProphet Cloud • Annual subscription fee • Charges by nodes or disks • Starts from 10 nodes or 50 disks On-Premise • Professional service charge for deployment • Annual subscription fee Integration Services
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. Thank You 20 Brian Jeng – brian.jeng@prophetstor.com