This document discusses disk health prediction for Ceph storage clusters. It describes current pain points like performance degradation during rebalancing and lack of predictive analytics. The DiskProphet solution uses machine learning to predict future disk failures proactively, reducing performance impacts by 90%. It integrates with Ceph through plugins to provide disk health monitoring and predictions to optimize the cluster.
2. 2
Ceph Pain Points
• Rebalancing drops
performance of a cluster
for an extended period of
time
• Limited visibility of OSD
health before failure
• Limited predictive
analytics
• Utilizes programmatic
intelligence rather than
machine learning
Slow
drives
Failed
drives
3. Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery
impacting client IO
OSD map changes due to loss of disk, resulting in PG peering
and backfilling
Results: Clients encounter impeded and slow IO.
Unbalanced data
distribution
Data on physical disks isn’t evenly distributed. Cluster may
be 50% full, but some disks are at 90%
Results: Backfill isn’t always able to complete.
Slow disk impacting
client IO
A single slow (impaired, not dead) OSD can severely impact
many clients until it’s ejected from the cluster.
Results: Client have slow or blocked IO.
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco
10/18/2016
5. • Problem
• RBD image data
distributed to all disks, but
single disk failure can
impact critical data IO
• Solution:
• predict future disk failure
(proactive)
• DiskProphet Solution
• Disk near-failure likelihood
prediction
• Disk life-expectancy
prediction
• Actions to optimize Ceph
Preemptive Detection of Disk Failure
Normal
workload
1 OSD failed, Ceph’s
rebalancing
1 OSD failure predicted,
No-Impact Recovery by
DiskProphet
IOPS
Time
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-cloud
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud, Yuming Ma, Cisco
10/18/2016
6. Benefits and Resolutions
Without Disk Prediction With Disk Prediction
Impact Time Days (subject to cluster size) 90% less performance degradation
7. Real World Prediction Results
7
Useful Reference: Stabilizing PB Ceph Cluster (a Cisco case)
https://www.slideshare.net/Red_Hat_Storage/red-hat-storage-day-seattle-stabilizing-petabyte-ceph-cluster-in-openstack-
cloud
Duration: 2016/01/01 – 2016/03/31 (90 days)
Number of Drives: over 20,000
Average Accuracy 96.1%,Recall 97%
8. Disk Prediction Plugin for Ceph
8
• Add Disk Prediction Plugin Service - https://github.com/ceph/ceph/pull/22239
• Add OSD Device Health Prediction - https://github.com/ceph/ceph/pull/22785
• DeviceHealth Module - https://github.com/ceph/ceph/pull/22479
10. Collected data
• Ceph cluster info/health state
• Ceph mon/osd/mds metadata
• Ceph osd/pool performance dump
• Ceph mon/osd/msd/osd/pool correlation
• OSD physical device health data (collected by devicehealth
plugin)
11. Disk Prediction Modes
• Cloud – predicted by the cloud prediction server.
• All device health data
• Ceph performance counter data
• Local – predicted by the plugin prediction module.
• Some attributes of the device health data.
12. DiskPrediction feedback
• DiskPrediction plugin writes prediction result into the device info
• Ceph device set life expectancy command
# ceph device set-life-expectancy <devid> <from> {<to>}
• Ceph device show life expectancy
# ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
TOSHIBA_DT01ACA050_44T2E4LAS devcnode1:sdb osd.0 >6w
13. Beyond Disk Failure Prediction
04Correlation Map with
Ceph
03Bad Sector, Slow
Drive & Anomaly
Detection
Disk Health and
Failure Prediction
01
Performance
Baseline,
Degradation &
Capacity
Prediction
02
DiskProphet
Features
14. DiskProphet Versions for CEPH
14
Community On-Prem Community Cloud Commercial Edition
Disk failure
prediction
Yes Yes Yes
Accuracy ★★★★ ★★★★★ ★★★★★
Confidence ★★★ ★★★★★ ★★★★★
Performance
prediction
No No Yes
Anomaly & Bad
Sector detection
No No Yes
Replacement
time
No No Yes
Correlation
analysis with
Ceph
No No Yes
Availability Trial Trial Now
15. Services
15
DiskProphet Cloud • Annual subscription fee
• Charges by nodes or disks
• Starts from 10 nodes or 50 disks
On-Premise
• Professional service charge for deployment
• Annual subscription fee
Integration
Services