SlideShare a Scribd company logo
Making Ceph Fast in the Face of Failure
Josh Durgin and Neha Ojha
2018.08.28
The Dark Ages
Hammer
Baseline Hammer
0
0.2
0.4
0.6
0.8
1
1.2
Recovery vs Baseline IOPS
IOPS/Baseline
Hammer
●
Favored maximum recovery speed
●
Default client impact was huge
●
Tuning could help
osd max backfills = 10 → 1
osd recovery max active = 15 → 3
osd recovery op priority = 10 → 3
osd recovery max single start = 5 → 1
Infernalis
Baseline Hammer Infernalis
0
0.2
0.4
0.6
0.8
1
1.2
Recovery vs Baseline IOPS
IOPS/Baseline
Manual way to recover higher-level constructs, e.g. rbd images
pg force-recovery command
ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
If you change your mind or prioritize wrong groups, use:
ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
Luminous – High-level Priority
Luminous – Throttling Recovery
●
osd_recovery_sleep
●
changing this will shift the balance between
recovery and client I/O
●
Different configurations based on underlying
hardware
Recovery Sleep - HDDs
BlueStore on HDDs with Fio 4k random writes
➢ osd_recovery_sleep_hdd chosen as 0.1
Recovery Sleep - SSDs
BlueStore on SSDs with Fio 4k random writes
➢ osd_recovery_sleep_ssd chosen as 0
Recovery Sleep - Hybrid
BlueStore on HDDs+SSDs with Fio 4k random writes
➢ osd_recovery_sleep_hybrid chosen as 0.025
Luminous Defaults are much better
Baseline Hammer Infernalis Luminous
0
0.2
0.4
0.6
0.8
1
1.2
Recovery vs Baseline IOPS
IOPS/Baseline
Recovery in Ceph has been a synchronous process - it blocked writes to an object
until it was recovered.
Problem: This increases write latencies and affects availability.
Solution in Mimic: Asynchronous Recovery
Do not block writes on objects, which are only missing on non-actingset OSDs.
Perform recovery in the background on an OSD, out of the acting set, similar to
backfill, and use the PG log to determine what needs recovery.
Improving Latency during Recovery
Async recovery targets - OSDs that are not part of the acting set and are chosen
based on the following:
● approximate magnitude of the difference in length of logs is used as the cost
of recovery, async recovery targets have higher cost to recover
● threshold osd_async_recovery_min_pg_log_entries(default value=100) is
used to determine when asynchronous recovery is appropriate
● min_size replicas should be available
When do we perform asynchronous recovery?
RGW Workload generated using Cosbench.
Operations - read, list, write, delete
Cluster prefilled ~ 30%, 40000 objects
Kill one OSD to induce recovery
Recovery Experiments
Impact on Throughput during Recovery
Impact on Latency during Recovery
No Failure
OSD Failure
Throughput Comparison - COSBench
- Optimizations
- Partial object recovery
- Speed up backfill scanning and transfer
- Adaptive recovery throttling - set the value of osd_recovery_sleep based on
client load.
- QoS
- Recovery Order
Future Work
- Upgrade! Much better defaults and finer tuning in Luminous and Mimic
- Recovery sleep is all you need if you want to change client i/o vs recovery
balance
- Tuning older options (e.g. max active, single start, max backfill) only needed if
you want to increase recovery/backfill throughput
Summary
THANK YOU
Josh Durgin : jdurgin@redhat.com
Neha Ojha : nojha@redhat.com

More Related Content

What's hot

Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
Karan Singh
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Karan Singh
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
Scott Leberknight
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
OpenStack Korea Community
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
Ceph Community
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
Ceph Community
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 
NoSQL 위에서 MMORPG 개발하기
NoSQL 위에서 MMORPG 개발하기NoSQL 위에서 MMORPG 개발하기
NoSQL 위에서 MMORPG 개발하기
Hoyoung Choi
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
Sage Weil
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
The Linux Foundation
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
Amazon Web Services
 
Introduction to KubeSphere and its open source ecosystem
Introduction to KubeSphere and its open source ecosystemIntroduction to KubeSphere and its open source ecosystem
Introduction to KubeSphere and its open source ecosystem
KubeSphere
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
confluent
 

What's hot (20)

Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
NoSQL 위에서 MMORPG 개발하기
NoSQL 위에서 MMORPG 개발하기NoSQL 위에서 MMORPG 개발하기
NoSQL 위에서 MMORPG 개발하기
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Introduction to KubeSphere and its open source ecosystem
Introduction to KubeSphere and its open source ecosystemIntroduction to KubeSphere and its open source ecosystem
Introduction to KubeSphere and its open source ecosystem
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Bluestore
BluestoreBluestore
Bluestore
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
 

Similar to Making Ceph fast in the face of failure

Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Cephalocon apac china
Cephalocon apac chinaCephalocon apac china
Cephalocon apac china
Vikhyat Umrao
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Ceph Community
 
Postgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even FasterPostgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even Faster
EDB
 
LCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis Tools
Linaro
 
Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus caging
BertrandDrouvot
 
MesosCon 2018
MesosCon 2018MesosCon 2018
MesosCon 2018
Pablo Delgado
 
Techniques to Improve Cache Speed
Techniques to Improve Cache SpeedTechniques to Improve Cache Speed
Techniques to Improve Cache Speed
Zohaib Hassan
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
Alexey Lesovsky
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Rongze Zhu
 
Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017
Grant McAlister
 
Filesystem Performance from a Database Perspective
Filesystem Performance from a Database PerspectiveFilesystem Performance from a Database Perspective
Filesystem Performance from a Database Perspective
Mark Wong
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Community
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Kohei KaiGai
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)
Joshua Drake
 
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust RewriteTaming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
ScyllaDB
 

Similar to Making Ceph fast in the face of failure (20)

Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Cephalocon apac china
Cephalocon apac chinaCephalocon apac china
Cephalocon apac china
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Postgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even FasterPostgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even Faster
 
LCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis Tools
 
Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus caging
 
MesosCon 2018
MesosCon 2018MesosCon 2018
MesosCon 2018
 
Techniques to Improve Cache Speed
Techniques to Improve Cache SpeedTechniques to Improve Cache Speed
Techniques to Improve Cache Speed
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017
 
Filesystem Performance from a Database Perspective
Filesystem Performance from a Database PerspectiveFilesystem Performance from a Database Perspective
Filesystem Performance from a Database Perspective
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)
 
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust RewriteTaming Go's Memory Usage — and Avoiding a Rust Rewrite
Taming Go's Memory Usage — and Avoiding a Rust Rewrite
 

Recently uploaded

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

Making Ceph fast in the face of failure

  • 1. Making Ceph Fast in the Face of Failure Josh Durgin and Neha Ojha 2018.08.28
  • 4. Hammer ● Favored maximum recovery speed ● Default client impact was huge ● Tuning could help osd max backfills = 10 → 1 osd recovery max active = 15 → 3 osd recovery op priority = 10 → 3 osd recovery max single start = 5 → 1
  • 6. Manual way to recover higher-level constructs, e.g. rbd images pg force-recovery command ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] If you change your mind or prioritize wrong groups, use: ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] Luminous – High-level Priority
  • 7. Luminous – Throttling Recovery ● osd_recovery_sleep ● changing this will shift the balance between recovery and client I/O ● Different configurations based on underlying hardware
  • 8. Recovery Sleep - HDDs BlueStore on HDDs with Fio 4k random writes ➢ osd_recovery_sleep_hdd chosen as 0.1
  • 9. Recovery Sleep - SSDs BlueStore on SSDs with Fio 4k random writes ➢ osd_recovery_sleep_ssd chosen as 0
  • 10. Recovery Sleep - Hybrid BlueStore on HDDs+SSDs with Fio 4k random writes ➢ osd_recovery_sleep_hybrid chosen as 0.025
  • 11. Luminous Defaults are much better Baseline Hammer Infernalis Luminous 0 0.2 0.4 0.6 0.8 1 1.2 Recovery vs Baseline IOPS IOPS/Baseline
  • 12. Recovery in Ceph has been a synchronous process - it blocked writes to an object until it was recovered. Problem: This increases write latencies and affects availability. Solution in Mimic: Asynchronous Recovery Do not block writes on objects, which are only missing on non-actingset OSDs. Perform recovery in the background on an OSD, out of the acting set, similar to backfill, and use the PG log to determine what needs recovery. Improving Latency during Recovery
  • 13. Async recovery targets - OSDs that are not part of the acting set and are chosen based on the following: ● approximate magnitude of the difference in length of logs is used as the cost of recovery, async recovery targets have higher cost to recover ● threshold osd_async_recovery_min_pg_log_entries(default value=100) is used to determine when asynchronous recovery is appropriate ● min_size replicas should be available When do we perform asynchronous recovery?
  • 14. RGW Workload generated using Cosbench. Operations - read, list, write, delete Cluster prefilled ~ 30%, 40000 objects Kill one OSD to induce recovery Recovery Experiments
  • 15. Impact on Throughput during Recovery
  • 16. Impact on Latency during Recovery
  • 17. No Failure OSD Failure Throughput Comparison - COSBench
  • 18. - Optimizations - Partial object recovery - Speed up backfill scanning and transfer - Adaptive recovery throttling - set the value of osd_recovery_sleep based on client load. - QoS - Recovery Order Future Work
  • 19. - Upgrade! Much better defaults and finer tuning in Luminous and Mimic - Recovery sleep is all you need if you want to change client i/o vs recovery balance - Tuning older options (e.g. max active, single start, max backfill) only needed if you want to increase recovery/backfill throughput Summary
  • 20. THANK YOU Josh Durgin : jdurgin@redhat.com Neha Ojha : nojha@redhat.com