SlideShare a Scribd company logo
Disk Health Prediction
for Ceph
Brian Jeng, brian.jeng@prophetstor.com
Copyright © 2018 by ProphetStor Data Services, Inc. All Rights Reserved.
Brian Jeng –
brian.jeng@prophetstor.com
2
Ceph Pain Points
• Rebalancing drops
performance of a cluster
for an extended period of
time
• Limited visibility of OSD
health before failure
• Limited predictive
analytics
• Utilizes programmatic
intelligence rather than
machine learning
Slow
drives
Failed
drives
Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery
impacting client IO
OSD map changes due to loss of disk, resulting in PG peering
and backfilling
Results: Clients encounter impeded and slow IO.
Unbalanced data
distribution
Data on physical disks isn’t evenly distributed. Cluster may
be 50% full, but some disks are at 90%
Results: Backfill isn’t always able to complete.
Slow disk impacting
client IO
A single slow (impaired, not dead) OSD can severely impact
many clients until it’s ejected from the cluster.
Results: Client have slow or blocked IO.
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco
10/18/2016
Evolve Ceph Intelligence
4
Programmatic Intelligence Machine Learning
• Pre-defined rules
• Reactive solution
• Self-learning & self-growing knowledge
• Proactive solution
Past Future
• Problem
• RBD image data
distributed to all disks, but
single disk failure can
impact critical data IO
• Solution:
• predict future disk failure
(proactive)
• DiskProphet Solution
• Disk near-failure likelihood
prediction
• Disk life-expectancy
prediction
• Actions to optimize Ceph
Preemptive Detection of Disk Failure
Normal
workload
1 OSD failed, Ceph’s
rebalancing
1 OSD failure predicted,
No-Impact Recovery by
DiskProphet
IOPS
Time
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco
10/18/2016
Benefits and Resolutions
Without Disk Prediction With Disk Prediction
Impact Time Days (subject to cluster size) 90% less performance degradation
Real World Prediction Results
7
Useful Reference: Stabilizing PB Ceph Cluster (a Cisco case)
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-
cloud
Duration: 2016/01/01 – 2016/03/31 (90 days)
Number of Drives: over 20,000
Average Accuracy 96.1%,Recall 97%
Disk Prediction Plugin for Ceph
8
• Add Disk Prediction Plugin Service - https://github.com/ceph/ceph/pull/22239
• Add OSD Device Health Prediction - https://github.com/ceph/ceph/pull/22785
• DeviceHealth Module - https://github.com/ceph/ceph/pull/22479
Architecture
DiskPrediction Plugin
Local prediction
module
Cloud Prediction Server
DeviceHealth Plugin
mgr
OSD
Device
health
Health monitor actions:
• OSD health warning
• Mark OSD out
• …
Prediction
result
Ceph
metrics
Collected data
• Ceph cluster info/health state
• Ceph mon/osd/mds metadata
• Ceph osd/pool performance dump
• Ceph mon/osd/msd/osd/pool correlation
• OSD physical device health data (collected by devicehealth
plugin)
Disk Prediction Modes
• Cloud – predicted by the cloud prediction server.
• All device health data
• Ceph performance counter data
• Local – predicted by the plugin prediction module.
• Some attributes of the device health data.
DiskPrediction feedback
• DiskPrediction plugin writes prediction result into the device info
• Ceph device set life expectancy command
# ceph device set-life-expectancy <devid> <from> {<to>}
• Ceph device show life expectancy
# ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
TOSHIBA_DT01ACA050_44T2E4LAS devcnode1:sdb osd.0 >6w
Beyond Disk Failure Prediction
04Correlation Map with
Ceph
03Bad Sector, Slow
Drive & Anomaly
Detection
Disk Health and
Failure Prediction
01
Performance
Baseline,
Degradation &
Capacity
Prediction
02
DiskProphet
Features
DiskProphet Versions for CEPH
14
Community On-Prem Community Cloud Commercial Edition
Disk failure
prediction
Yes Yes Yes
Accuracy ★★★★ ★★★★★ ★★★★★
Confidence ★★★ ★★★★★ ★★★★★
Performance
prediction
No No Yes
Anomaly & Bad
Sector detection
No No Yes
Replacement
time
No No Yes
Correlation
analysis with
Ceph
No No Yes
Availability Trial Trial Now
Services
15
DiskProphet Cloud • Annual subscription fee
• Charges by nodes or disks
• Starts from 10 nodes or 50 disks
On-Premise
• Professional service charge for deployment
• Annual subscription fee
Integration
Services
16
17
18
19
Thank You
20
Brian Jeng –
brian.jeng@prophetstor.com

More Related Content

What's hot

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Community
 
OpenStack DRaaS - Freezer - 101
OpenStack DRaaS - Freezer - 101OpenStack DRaaS - Freezer - 101
OpenStack DRaaS - Freezer - 101
Trinath Somanchi
 
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
Mahmoud Shiri Varamini
 
How to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephHow to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with Ceph
Sean Cohen
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephProtecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Sean Cohen
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
OpenStack Korea Community
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
Karan Singh
 
Drive into calico architecture
Drive into calico architectureDrive into calico architecture
Drive into calico architecture
Anirban Sen Chowdhary
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Karan Singh
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
Sage Weil
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Ceph Community
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
Ceph Community
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region mode
Joe Huang
 

What's hot (20)

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
OpenStack DRaaS - Freezer - 101
OpenStack DRaaS - Freezer - 101OpenStack DRaaS - Freezer - 101
OpenStack DRaaS - Freezer - 101
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
How to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephHow to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with Ceph
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephProtecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
Drive into calico architecture
Drive into calico architectureDrive into calico architecture
Drive into calico architecture
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region mode
 

Similar to Disk health prediction for Ceph

CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
Ceph Community
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterAndrei Khurshudov
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
nadirpervez2
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Sandesh Rao
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
DataWorks Summit
 
6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud
Locuto Riorama
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
Jorge Cardoso
 
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The SequelVMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red_Hat_Storage
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
Jason Ragsdale
 
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre AnforderungenBlock & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
NEXTtour
 
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
Insight Technology, Inc.
 
Database Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12cDatabase Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12c
Hari Srinivasan
 
SoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo LabSoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo Lab
Valentin Kropov
 
Automate DG Best Practices
Automate DG  Best PracticesAutomate DG  Best Practices
Automate DG Best Practices
Mohsen B
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations
Locuto Riorama
 
Oracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c PresentationOracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c Presentation
Francisco Alvarez
 
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Skytap Cloud
 
Tips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the ServerTips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the Server
Senturus
 

Similar to Disk health prediction for Ceph (20)

CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud6212883126866262792 performance testing_cloud
6212883126866262792 performance testing_cloud
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The SequelVMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
VMworld Europe 2014: Virtualizing Databases Doing IT Right – The Sequel
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre AnforderungenBlock & File Services – Die Lösung von Nutanix für ihre Anforderungen
Block & File Services – Die Lösung von Nutanix für ihre Anforderungen
 
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
[db tech showcase Tokyo 2016] E34: Oracle SE - RAC, HA and Standby are Still ...
 
Database Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12cDatabase Upgrades Automation using Enterprise Manager 12c
Database Upgrades Automation using Enterprise Manager 12c
 
SoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo LabSoftServe's Hadoop Demo Lab
SoftServe's Hadoop Demo Lab
 
Automate DG Best Practices
Automate DG  Best PracticesAutomate DG  Best Practices
Automate DG Best Practices
 
4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations4392091081755796971 emea10 zero_downtimeoperations
4392091081755796971 emea10 zero_downtimeoperations
 
Oracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c PresentationOracle Enterprise Manager 12c - OEM12c Presentation
Oracle Enterprise Manager 12c - OEM12c Presentation
 
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
 
Tips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the ServerTips for Installing Cognos Analytics: Configuring and Installing the Server
Tips for Installing Cognos Analytics: Configuring and Installing the Server
 

Recently uploaded

Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 

Recently uploaded (20)

Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 

Disk health prediction for Ceph

  • 1. Disk Health Prediction for Ceph Brian Jeng, brian.jeng@prophetstor.com Copyright © 2018 by ProphetStor Data Services, Inc. All Rights Reserved. Brian Jeng – brian.jeng@prophetstor.com
  • 2. 2 Ceph Pain Points • Rebalancing drops performance of a cluster for an extended period of time • Limited visibility of OSD health before failure • Limited predictive analytics • Utilizes programmatic intelligence rather than machine learning Slow drives Failed drives
  • 3. Major Stability Problems: Cluster Problem Impact Backfill & Recovery impacting client IO OSD map changes due to loss of disk, resulting in PG peering and backfilling Results: Clients encounter impeded and slow IO. Unbalanced data distribution Data on physical disks isn’t evenly distributed. Cluster may be 50% full, but some disks are at 90% Results: Backfill isn’t always able to complete. Slow disk impacting client IO A single slow (impaired, not dead) OSD can severely impact many clients until it’s ejected from the cluster. Results: Client have slow or blocked IO. https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco 10/18/2016
  • 4. Evolve Ceph Intelligence 4 Programmatic Intelligence Machine Learning • Pre-defined rules • Reactive solution • Self-learning & self-growing knowledge • Proactive solution Past Future
  • 5. • Problem • RBD image data distributed to all disks, but single disk failure can impact critical data IO • Solution: • predict future disk failure (proactive) • DiskProphet Solution • Disk near-failure likelihood prediction • Disk life-expectancy prediction • Actions to optimize Ceph Preemptive Detection of Disk Failure Normal workload 1 OSD failed, Ceph’s rebalancing 1 OSD failure predicted, No-Impact Recovery by DiskProphet IOPS Time https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco 10/18/2016
  • 6. Benefits and Resolutions Without Disk Prediction With Disk Prediction Impact Time Days (subject to cluster size) 90% less performance degradation
  • 7. Real World Prediction Results 7 Useful Reference: Stabilizing PB Ceph Cluster (a Cisco case) https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack- cloud Duration: 2016/01/01 – 2016/03/31 (90 days) Number of Drives: over 20,000 Average Accuracy 96.1%,Recall 97%
  • 8. Disk Prediction Plugin for Ceph 8 • Add Disk Prediction Plugin Service - https://github.com/ceph/ceph/pull/22239 • Add OSD Device Health Prediction - https://github.com/ceph/ceph/pull/22785 • DeviceHealth Module - https://github.com/ceph/ceph/pull/22479
  • 9. Architecture DiskPrediction Plugin Local prediction module Cloud Prediction Server DeviceHealth Plugin mgr OSD Device health Health monitor actions: • OSD health warning • Mark OSD out • … Prediction result Ceph metrics
  • 10. Collected data • Ceph cluster info/health state • Ceph mon/osd/mds metadata • Ceph osd/pool performance dump • Ceph mon/osd/msd/osd/pool correlation • OSD physical device health data (collected by devicehealth plugin)
  • 11. Disk Prediction Modes • Cloud – predicted by the cloud prediction server. • All device health data • Ceph performance counter data • Local – predicted by the plugin prediction module. • Some attributes of the device health data.
  • 12. DiskPrediction feedback • DiskPrediction plugin writes prediction result into the device info • Ceph device set life expectancy command # ceph device set-life-expectancy <devid> <from> {<to>} • Ceph device show life expectancy # ceph device ls DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY TOSHIBA_DT01ACA050_44T2E4LAS devcnode1:sdb osd.0 >6w
  • 13. Beyond Disk Failure Prediction 04Correlation Map with Ceph 03Bad Sector, Slow Drive & Anomaly Detection Disk Health and Failure Prediction 01 Performance Baseline, Degradation & Capacity Prediction 02 DiskProphet Features
  • 14. DiskProphet Versions for CEPH 14 Community On-Prem Community Cloud Commercial Edition Disk failure prediction Yes Yes Yes Accuracy ★★★★ ★★★★★ ★★★★★ Confidence ★★★ ★★★★★ ★★★★★ Performance prediction No No Yes Anomaly & Bad Sector detection No No Yes Replacement time No No Yes Correlation analysis with Ceph No No Yes Availability Trial Trial Now
  • 15. Services 15 DiskProphet Cloud • Annual subscription fee • Charges by nodes or disks • Starts from 10 nodes or 50 disks On-Premise • Professional service charge for deployment • Annual subscription fee Integration Services
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. Thank You 20 Brian Jeng – brian.jeng@prophetstor.com