Making Ceph fast in the face of failure

•

0 likes•1,884 views

mountpoint.io

Making Ceph fast in the face of failure with Neha Ojha and Josh Durgin

Technology

Making Ceph Fast in the Face of Failure
Josh Durgin and Neha Ojha
2018.08.28

Hammer
Baseline Hammer
0
0.2
0.4
0.6
0.8
1
1.2
Recovery vs Baseline IOPS
IOPS/Baseline

Hammer
●
Favored maximum recovery speed
●
Default client impact was huge
●
Tuning could help
osd max backfills = 10 → 1
osd recovery max active = 15 → 3
osd recovery op priority = 10 → 3
osd recovery max single start = 5 → 1

Infernalis
Baseline Hammer Infernalis
0
0.2
0.4
0.6
0.8
1
1.2
Recovery vs Baseline IOPS
IOPS/Baseline

Manual way to recover higher-level constructs, e.g. rbd images
pg force-recovery command
ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
If you change your mind or prioritize wrong groups, use:
ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
Luminous – High-level Priority

Luminous – Throttling Recovery
●
osd_recovery_sleep
●
changing this will shift the balance between
recovery and client I/O
●
Different configurations based on underlying
hardware

Recovery Sleep - HDDs
BlueStore on HDDs with Fio 4k random writes
➢ osd_recovery_sleep_hdd chosen as 0.1

Recovery Sleep - SSDs
BlueStore on SSDs with Fio 4k random writes
➢ osd_recovery_sleep_ssd chosen as 0

Recovery Sleep - Hybrid
BlueStore on HDDs+SSDs with Fio 4k random writes
➢ osd_recovery_sleep_hybrid chosen as 0.025

Luminous Defaults are much better
Baseline Hammer Infernalis Luminous
0
0.2
0.4
0.6
0.8
1
1.2
Recovery vs Baseline IOPS
IOPS/Baseline

Recovery in Ceph has been a synchronous process - it blocked writes to an object
until it was recovered.
Problem: This increases write latencies and affects availability.
Solution in Mimic: Asynchronous Recovery
Do not block writes on objects, which are only missing on non-actingset OSDs.
Perform recovery in the background on an OSD, out of the acting set, similar to
backfill, and use the PG log to determine what needs recovery.
Improving Latency during Recovery

Async recovery targets - OSDs that are not part of the acting set and are chosen
based on the following:
● approximate magnitude of the difference in length of logs is used as the cost
of recovery, async recovery targets have higher cost to recover
● threshold osd_async_recovery_min_pg_log_entries(default value=100) is
used to determine when asynchronous recovery is appropriate
● min_size replicas should be available
When do we perform asynchronous recovery?

RGW Workload generated using Cosbench.
Operations - read, list, write, delete
Cluster prefilled ~ 30%, 40000 objects
Kill one OSD to induce recovery
Recovery Experiments

No Failure
OSD Failure
Throughput Comparison - COSBench

- Optimizations
- Partial object recovery
- Speed up backfill scanning and transfer
- Adaptive recovery throttling - set the value of osd_recovery_sleep based on
client load.
- QoS
- Recovery Order
Future Work

- Upgrade! Much better defaults and finer tuning in Luminous and Mimic
- Recovery sleep is all you need if you want to change client i/o vs recovery
balance
- Tuning older options (e.g. max active, single start, max backfill) only needed if
you want to increase recovery/backfill throughput
Summary

THANK YOU
Josh Durgin : jdurgin@redhat.com
Neha Ojha : nojha@redhat.com

What's hot

Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang

High Availability PostgreSQL with Zalando PatroniZalando Technology

Webinar: MongoDB Schema Design and Performance ImplicationsMongoDB

Secure container: Kata container and gVisorChing-Hsuan Yen

Windows IOCP vs Linux EPOLL Performance ComparisonSeungmo Koo

Gateway/APIC securityShiu-Fun Poon

Minio Cloud StorageMinio

2021.02 new in Ceph Pacific DashboardCeph Community

Deep Dive on Amazon DynamoDBAmazon Web Services

Redis for duplicate detection on real time streamRoberto Franchini

Redis in PracticeNoah Davis

Building Open Source Identity Management with FreeIPALDAPCon

Pynvme introductionCrane Chu

Xen DebuggingThe Linux Foundation

Black Hat Europe 2017. DPAPI and DPAPI-NG: Decryption ToolkitPaula Januszkiewicz

Twitter의 snowflake 소개 및 활용흥배 최

Deep Dive on Amazon EC2 instancesAmazon Web Services

ceph optimization on ssd ilsoo byun-shortNAVER D2

How to Import JSON Using Cypher and APOCNeo4j

QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar

What's hot (20)

Vmlinux: anatomy of bzimage and how x86 64 processor is booted

High Availability PostgreSQL with Zalando Patroni

Webinar: MongoDB Schema Design and Performance Implications

Secure container: Kata container and gVisor

Windows IOCP vs Linux EPOLL Performance Comparison

Gateway/APIC security

Minio Cloud Storage

2021.02 new in Ceph Pacific Dashboard

Deep Dive on Amazon DynamoDB

Redis for duplicate detection on real time stream

Redis in Practice

Building Open Source Identity Management with FreeIPA

Pynvme introduction

Xen Debugging

Black Hat Europe 2017. DPAPI and DPAPI-NG: Decryption Toolkit

Twitter의 snowflake 소개 및 활용

Deep Dive on Amazon EC2 instances

ceph optimization on ssd ilsoo byun-short

How to Import JSON Using Cypher and APOC

QEMU Disk IO Which performs Better: Native or threads?

Similar to Making Ceph fast in the face of failure

Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community

Cephalocon apac chinaVikhyat Umrao

Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Ceph Community

Postgres Vision 2018: Making Postgres Even FasterEDB

LCU14 201- Binary Analysis ToolsLinaro

Oracle: Binding versus cagingBertrandDrouvot

Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh

MesosCon 2018Pablo Delgado

Techniques to Improve Cache SpeedZohaib Hassan

ceph-barcelona-v-1.2Ranga Swami Reddy Muthumula

Ceph barcelona-v-1.2Ranga Swami Reddy Muthumula

Streaming replication in practiceAlexey Lesovsky

Emr spark tuning demystifiedOmid Vahdaty

Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu

Deep dive into the Rds PostgreSQL Universe Austin 2017Grant McAlister

Filesystem Performance from a Database PerspectiveMark Wong

Ceph Day Melbourne - Troubleshooting Ceph Ceph Community

Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai

DevoxxUK: Optimizating Application Performance on KubernetesDinakar Guniguntala

Dumb Simple PostgreSQL Performance (NYCPUG)Joshua Drake

Similar to Making Ceph fast in the face of failure (20)

Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned

Cephalocon apac china

Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...

Postgres Vision 2018: Making Postgres Even Faster

LCU14 201- Binary Analysis Tools

Oracle: Binding versus caging

Ceph Object Storage Performance Secrets and Ceph Data Lake Solution

MesosCon 2018

Techniques to Improve Cache Speed

ceph-barcelona-v-1.2

Ceph barcelona-v-1.2

Streaming replication in practice

Emr spark tuning demystified

Build an High-Performance and High-Durable Block Storage Service Based on Ceph

Deep dive into the Rds PostgreSQL Universe Austin 2017

Filesystem Performance from a Database Perspective

Ceph Day Melbourne - Troubleshooting Ceph

Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)

DevoxxUK: Optimizating Application Performance on Kubernetes

Dumb Simple PostgreSQL Performance (NYCPUG)

Recently uploaded

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Install Stable Diffusion in windows machinePadma Pradeep

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Pigging Solutions in Pet Food ManufacturingPigging Solutions

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

AI as an Interface for Commercial BuildingsMemoori

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

SQL Database Design For Developers at php[tek] 2024

Install Stable Diffusion in windows machine

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

How to Remove Document Management Hurdles with X-Docs?

Pigging Solutions in Pet Food Manufacturing

My Hashitalk Indonesia April 2024 Presentation

Understanding the Laravel MVC Architecture

Presentation on how to chat with PDF using ChatGPT code interpreter

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

08448380779 Call Girls In Civil Lines Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

AI as an Interface for Commercial Buildings

Injustice - Developers Among Us (SciFiDevCon 2024)

Azure Monitor & Application Insight to monitor Infrastructure & Application

Making Ceph fast in the face of failure

1. Making Ceph Fast in the Face of Failure Josh Durgin and Neha Ojha 2018.08.28

2. The Dark Ages

3. Hammer Baseline Hammer 0 0.2 0.4 0.6 0.8 1 1.2 Recovery vs Baseline IOPS IOPS/Baseline

4. Hammer ● Favored maximum recovery speed ● Default client impact was huge ● Tuning could help osd max backfills = 10 → 1 osd recovery max active = 15 → 3 osd recovery op priority = 10 → 3 osd recovery max single start = 5 → 1

5. Infernalis Baseline Hammer Infernalis 0 0.2 0.4 0.6 0.8 1 1.2 Recovery vs Baseline IOPS IOPS/Baseline

6. Manual way to recover higher-level constructs, e.g. rbd images pg force-recovery command ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] If you change your mind or prioritize wrong groups, use: ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] Luminous – High-level Priority

7. Luminous – Throttling Recovery ● osd_recovery_sleep ● changing this will shift the balance between recovery and client I/O ● Different configurations based on underlying hardware

8. Recovery Sleep - HDDs BlueStore on HDDs with Fio 4k random writes ➢ osd_recovery_sleep_hdd chosen as 0.1

9. Recovery Sleep - SSDs BlueStore on SSDs with Fio 4k random writes ➢ osd_recovery_sleep_ssd chosen as 0

10. Recovery Sleep - Hybrid BlueStore on HDDs+SSDs with Fio 4k random writes ➢ osd_recovery_sleep_hybrid chosen as 0.025

11. Luminous Defaults are much better Baseline Hammer Infernalis Luminous 0 0.2 0.4 0.6 0.8 1 1.2 Recovery vs Baseline IOPS IOPS/Baseline

12. Recovery in Ceph has been a synchronous process - it blocked writes to an object until it was recovered. Problem: This increases write latencies and affects availability. Solution in Mimic: Asynchronous Recovery Do not block writes on objects, which are only missing on non-actingset OSDs. Perform recovery in the background on an OSD, out of the acting set, similar to backfill, and use the PG log to determine what needs recovery. Improving Latency during Recovery

13. Async recovery targets - OSDs that are not part of the acting set and are chosen based on the following: ● approximate magnitude of the difference in length of logs is used as the cost of recovery, async recovery targets have higher cost to recover ● threshold osd_async_recovery_min_pg_log_entries(default value=100) is used to determine when asynchronous recovery is appropriate ● min_size replicas should be available When do we perform asynchronous recovery?

14. RGW Workload generated using Cosbench. Operations - read, list, write, delete Cluster prefilled ~ 30%, 40000 objects Kill one OSD to induce recovery Recovery Experiments

15. Impact on Throughput during Recovery

16. Impact on Latency during Recovery

17. No Failure OSD Failure Throughput Comparison - COSBench

18. - Optimizations - Partial object recovery - Speed up backfill scanning and transfer - Adaptive recovery throttling - set the value of osd_recovery_sleep based on client load. - QoS - Recovery Order Future Work

19. - Upgrade! Much better defaults and finer tuning in Luminous and Mimic - Recovery sleep is all you need if you want to change client i/o vs recovery balance - Tuning older options (e.g. max active, single start, max backfill) only needed if you want to increase recovery/backfill throughput Summary

20. THANK YOU Josh Durgin : jdurgin@redhat.com Neha Ojha : nojha@redhat.com

Making Ceph fast in the face of failure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making Ceph fast in the face of failure

Similar to Making Ceph fast in the face of failure (20)

Recently uploaded

Recently uploaded (20)

Making Ceph fast in the face of failure