SlideShare a Scribd company logo
Ceph in 2023 & Beyond
HEPiX Autumn 2023 Workshop
October 18, 2023
Dan van der Ster
Ceph Executive Council / CTO Clyso GmbH
About Me
● University of Victoria - 1998:
○ B.Eng in Computer Engineering @ UVic
○ PhD in Grid Computing @ UVic – Supervisor Dr. Randall J. Sobie
● CERN - 2008
○ Grid Group: ATLAS Distributed Analysis Dev and Coordinator 2008-2012
○ Storage Group: AFS, CVMFS, Ceph Service Manager 2013-2022
○ Governance Group: Chief IT Architect 2022-2023
○ Sabbatical Leave 2023-present
● Ceph Open Source Project - 2013:
○ Ceph Foundation Board Member 2015-present
○ Ceph Executive Council 2021-present
● Clyso GmbH - 2023
○ CTO – leading North American expansion
2
Outline
● Brief Introduction to Ceph
● Recent Developments
● Ceph Community News
● What I’m working on
3
Introduction to Ceph
● How many of you know Ceph? operate Ceph? like/dislike Ceph?
● Built upon a Reliable Autonomic Distributed Object Store: RADOS
● Objects are distributed pseudorandomly using CRUSH
● End result:
○ Enterprise-quality Block, File, and Object storage using commodity hardware
○ Scalable, reliable, organic technology backing much of the world’s cloud infrastructures
○ Open Source Software – the Linux of Storage
4
History of Ceph
● 2007 - Sage Weil’s PhD on CRUSH and CephFS
● 2011 - Inktank startup founded to commercialize Ceph
● 2013 - CERN started using Ceph
● 2014 - Inktank acquired by Red Hat
● 2014 - Dan presented Ceph@CERN: One year on.. At HEPIX LAPP
● 2018 - Creation of the Ceph Foundation
● 2019 - Red Hat acquired by IBM
● 2023 - Ceph team reassigned from RH to IBM
5
History of Ceph
6
Ceph Architecture
● RADOS: low-level object store
● RBD: virtual block devices e.g.
/dev/vdb attached to your VM
● CephFS: a shared network file
system, mounted like NFS/AFS/…
● S3: HTTP-like object store,
GET/PUT, AWS compatible.
● Integrations: OpenStack (Volumes,
Shares, Object), Kubernetes (PVCs,
Rook), …
7
Ceph Components
● OSDs (disks/NVMes)
○ 4-8GB RAM per device
○ BlueStore+RocksDB on-disk format
● MON/MGR
○ Central cluster maps, not in IO path
○ Smallish servers, Reliable via PAXOS
● MDS (CephFS)
○ Scale-out metadata, hot/cold standbys
○ O(100GB) RAM each, single threaded
● RGW (S3)
○ Scale-out S3-compatible gateways
○ Multi-region support
8
All built on commodity hardware
Ceph Software Releases
https://ceph.io 9
Reef v18 Highlights
● (Please don’t be underwhelmed – Ceph is stable software)
● RADOS: mem usage fixes, dist QoS with mclock, custom WAL, 4kB alloc units
for BlueFS, read IO balancer
● RBD: NVMeoF target gateway, persistent wb cache, rbd-mirror ++
● CephFS: cephfs-top, fscrypt, stability ++
● RGW: rate limiting, SSE-S3, s3select, multisite replication ++
● Dashboard: 1-click OSD create, capacity planning, upgrades, S3 multisite, S3
policy admin
10
Ceph Community
● Ceph Foundation
○ 40 corporate + associate members
○ Supports neutral upstream development, testing, documentation, events, marketing
● Events:
○ Ceph Days 2023 - NYC, SoCal, India, Seoul, Vancouver
○ Cephalocon 2023 - Amsterdam
○ All talks recorded and shared on Youtube
● Securing the Foundation:
○ New tiers to secure the project’s future
○ Plans to invest in more infra, bigger events
● Technical Meetups:
○ Ceph Leadership Team + Component Weekly
○ Ceph Developer Monthly
11
What I’m working on
12
My Favourite Bugs
● Bug of the Year 2020: OSDMap LZ4 Corruptions
○ Symptom: Cluster-wide of OSD aborts with osdmap crc errors
○ Recovered the cluster by injecting an older valid osdmap
○ RCA: osdmaps had 4 flipped bits, caused by LZ4 which corrupted non-contiguous inputs in rare
cases.
○ Solution: defrag ceph_buffers before compressing, and the OS upgraded its LZ4 library.
● Bug of the Year 2022: OSD PG Log "Dup" Bug
○ Symptom: For several months users reported OSDs consuming 100’s of GBs of RAM, even after
restart. Mempool dumps showed huge allocations in the pg_log buffers.
○ RCA: pg splitting and merging violated the ordering of the duplicate op log, preventing
trimming.
○ Solution: offline trim command for the OSD, and better online pg log management.
13
My Favourite Bugs
● Bug of the Year 2020: OSDMap LZ4 Corruptions
○ Symptom: Cluster-wide of OSD aborts with osdmap crc errors
○ Recovered the cluster by injecting an older valid osdmap
○ RCA: osdmaps had 4 flipped bits, caused by LZ4 which corrupted non-contiguous inputs in rare
cases.
○ Solution: defrag ceph_buffers before compressing, and the OS upgraded its LZ4 library.
● Bug of the Year 2022: OSD PG Log "Dup" Bug
○ Symptom: For several months users reported OSDs consuming 100’s of GBs of RAM, even after
restart. Mempool dumps showed huge allocations in the pg_log buffers.
○ RCA: pg splitting and merging violated the ordering of the duplicate op log, preventing
trimming.
○ Solution: offline trim command for the OSD, and better online pg log management.
14
FIXED
My Favourite Plot
15
My Favourite Plot
16
Modern devices have a “media cache” which has a huge impact on BlueStore performance
Read ceph.com Hardware Recommendations re: disabling device writeback caches
FIXED
My 2nd Favourite Plot
Potential 4x sped up IO path after workload analysis here at UVic!
17
W
IP
Comparing Use-Cases
● CERN uses Ceph to back its cloud infrastructure: 100PB of block, S3, FS.
● In my new role I’m exposed to much more Ceph in very different envs:
○ Ranging from 10’s of TB to multiples exabytes. Cluster in a closet to 100s of clusters globally.
○ “Microsoft/VMWare is too expensive”. Moving to Proxmox+Ceph.
○ “Data is our product – We need full ownership of the platform.”
○ “Ceph backs the things that make us money – if it’s down we’ll lose $$$ per minute”
○ “Xyz is too expensive, we’re locked in → FOSS Ceph is the best alternative we found”
● Lots and lots of successful uses out there – around 5 exabytes across
thousands of clusters.
● But common themes – pain points – are emerging:
○ Ceph performance is not obvious – selecting hardware, NVMe, Crimson, multi-MDS, …
○ Ceph is still too difficult to understand and operate. #AI-OPS to the rescue?
18
#AI-Ops ??
19
ChatGPT’s recommendations vary
between useless and very dangerous.
Warning: do not use in production!
Ceph Cluster Analyzer
● I want to build tools that
help people run Ceph.
● Step 1: a website which will
grade your ceph cluster.
● Try it now:
○ https://analyzer.clyso.com
● Coming soonTM
○ Clyso Enterprise Storage
○ Ceph Copilot
○ Chorus Multisite S3
20
Thank you
21
dan.vanderster@clyso.com

More Related Content

Similar to Ceph in 2023 and Beyond.pdf

Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
Ceph Community
 
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
Ceph Community
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
Ceph Community
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
Patrick McGarry
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
ShapeBlue
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote
Ceph Community
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
Ceph Community
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
Patrick McGarry
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
Erik Krogen
 
Ceph and Apache CloudStack
Ceph and Apache CloudStackCeph and Apache CloudStack
Ceph and Apache CloudStack
ke4qqq
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
Patrick McGarry
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
Nikos Kormpakis
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
Nikos Kormpakis
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
Rob Gardner
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
Ceph Community
 
Rook: Storage for Containers in Containers – data://disrupted® 2020
Rook: Storage for Containers in Containers  – data://disrupted® 2020Rook: Storage for Containers in Containers  – data://disrupted® 2020
Rook: Storage for Containers in Containers – data://disrupted® 2020
data://disrupted®
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
Ceph Day Santa Clara: Ceph and Apache CloudStack
Ceph Day Santa Clara: Ceph and Apache CloudStack Ceph Day Santa Clara: Ceph and Apache CloudStack
Ceph Day Santa Clara: Ceph and Apache CloudStack
Ceph Community
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Community
 

Similar to Ceph in 2023 and Beyond.pdf (20)

Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Ceph and Apache CloudStack
Ceph and Apache CloudStackCeph and Apache CloudStack
Ceph and Apache CloudStack
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Rook: Storage for Containers in Containers – data://disrupted® 2020
Rook: Storage for Containers in Containers  – data://disrupted® 2020Rook: Storage for Containers in Containers  – data://disrupted® 2020
Rook: Storage for Containers in Containers – data://disrupted® 2020
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Ceph Day Santa Clara: Ceph and Apache CloudStack
Ceph Day Santa Clara: Ceph and Apache CloudStack Ceph Day Santa Clara: Ceph and Apache CloudStack
Ceph Day Santa Clara: Ceph and Apache CloudStack
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Ceph in 2023 and Beyond.pdf

  • 1. Ceph in 2023 & Beyond HEPiX Autumn 2023 Workshop October 18, 2023 Dan van der Ster Ceph Executive Council / CTO Clyso GmbH
  • 2. About Me ● University of Victoria - 1998: ○ B.Eng in Computer Engineering @ UVic ○ PhD in Grid Computing @ UVic – Supervisor Dr. Randall J. Sobie ● CERN - 2008 ○ Grid Group: ATLAS Distributed Analysis Dev and Coordinator 2008-2012 ○ Storage Group: AFS, CVMFS, Ceph Service Manager 2013-2022 ○ Governance Group: Chief IT Architect 2022-2023 ○ Sabbatical Leave 2023-present ● Ceph Open Source Project - 2013: ○ Ceph Foundation Board Member 2015-present ○ Ceph Executive Council 2021-present ● Clyso GmbH - 2023 ○ CTO – leading North American expansion 2
  • 3. Outline ● Brief Introduction to Ceph ● Recent Developments ● Ceph Community News ● What I’m working on 3
  • 4. Introduction to Ceph ● How many of you know Ceph? operate Ceph? like/dislike Ceph? ● Built upon a Reliable Autonomic Distributed Object Store: RADOS ● Objects are distributed pseudorandomly using CRUSH ● End result: ○ Enterprise-quality Block, File, and Object storage using commodity hardware ○ Scalable, reliable, organic technology backing much of the world’s cloud infrastructures ○ Open Source Software – the Linux of Storage 4
  • 5. History of Ceph ● 2007 - Sage Weil’s PhD on CRUSH and CephFS ● 2011 - Inktank startup founded to commercialize Ceph ● 2013 - CERN started using Ceph ● 2014 - Inktank acquired by Red Hat ● 2014 - Dan presented Ceph@CERN: One year on.. At HEPIX LAPP ● 2018 - Creation of the Ceph Foundation ● 2019 - Red Hat acquired by IBM ● 2023 - Ceph team reassigned from RH to IBM 5
  • 7. Ceph Architecture ● RADOS: low-level object store ● RBD: virtual block devices e.g. /dev/vdb attached to your VM ● CephFS: a shared network file system, mounted like NFS/AFS/… ● S3: HTTP-like object store, GET/PUT, AWS compatible. ● Integrations: OpenStack (Volumes, Shares, Object), Kubernetes (PVCs, Rook), … 7
  • 8. Ceph Components ● OSDs (disks/NVMes) ○ 4-8GB RAM per device ○ BlueStore+RocksDB on-disk format ● MON/MGR ○ Central cluster maps, not in IO path ○ Smallish servers, Reliable via PAXOS ● MDS (CephFS) ○ Scale-out metadata, hot/cold standbys ○ O(100GB) RAM each, single threaded ● RGW (S3) ○ Scale-out S3-compatible gateways ○ Multi-region support 8 All built on commodity hardware
  • 10. Reef v18 Highlights ● (Please don’t be underwhelmed – Ceph is stable software) ● RADOS: mem usage fixes, dist QoS with mclock, custom WAL, 4kB alloc units for BlueFS, read IO balancer ● RBD: NVMeoF target gateway, persistent wb cache, rbd-mirror ++ ● CephFS: cephfs-top, fscrypt, stability ++ ● RGW: rate limiting, SSE-S3, s3select, multisite replication ++ ● Dashboard: 1-click OSD create, capacity planning, upgrades, S3 multisite, S3 policy admin 10
  • 11. Ceph Community ● Ceph Foundation ○ 40 corporate + associate members ○ Supports neutral upstream development, testing, documentation, events, marketing ● Events: ○ Ceph Days 2023 - NYC, SoCal, India, Seoul, Vancouver ○ Cephalocon 2023 - Amsterdam ○ All talks recorded and shared on Youtube ● Securing the Foundation: ○ New tiers to secure the project’s future ○ Plans to invest in more infra, bigger events ● Technical Meetups: ○ Ceph Leadership Team + Component Weekly ○ Ceph Developer Monthly 11
  • 13. My Favourite Bugs ● Bug of the Year 2020: OSDMap LZ4 Corruptions ○ Symptom: Cluster-wide of OSD aborts with osdmap crc errors ○ Recovered the cluster by injecting an older valid osdmap ○ RCA: osdmaps had 4 flipped bits, caused by LZ4 which corrupted non-contiguous inputs in rare cases. ○ Solution: defrag ceph_buffers before compressing, and the OS upgraded its LZ4 library. ● Bug of the Year 2022: OSD PG Log "Dup" Bug ○ Symptom: For several months users reported OSDs consuming 100’s of GBs of RAM, even after restart. Mempool dumps showed huge allocations in the pg_log buffers. ○ RCA: pg splitting and merging violated the ordering of the duplicate op log, preventing trimming. ○ Solution: offline trim command for the OSD, and better online pg log management. 13
  • 14. My Favourite Bugs ● Bug of the Year 2020: OSDMap LZ4 Corruptions ○ Symptom: Cluster-wide of OSD aborts with osdmap crc errors ○ Recovered the cluster by injecting an older valid osdmap ○ RCA: osdmaps had 4 flipped bits, caused by LZ4 which corrupted non-contiguous inputs in rare cases. ○ Solution: defrag ceph_buffers before compressing, and the OS upgraded its LZ4 library. ● Bug of the Year 2022: OSD PG Log "Dup" Bug ○ Symptom: For several months users reported OSDs consuming 100’s of GBs of RAM, even after restart. Mempool dumps showed huge allocations in the pg_log buffers. ○ RCA: pg splitting and merging violated the ordering of the duplicate op log, preventing trimming. ○ Solution: offline trim command for the OSD, and better online pg log management. 14 FIXED
  • 16. My Favourite Plot 16 Modern devices have a “media cache” which has a huge impact on BlueStore performance Read ceph.com Hardware Recommendations re: disabling device writeback caches FIXED
  • 17. My 2nd Favourite Plot Potential 4x sped up IO path after workload analysis here at UVic! 17 W IP
  • 18. Comparing Use-Cases ● CERN uses Ceph to back its cloud infrastructure: 100PB of block, S3, FS. ● In my new role I’m exposed to much more Ceph in very different envs: ○ Ranging from 10’s of TB to multiples exabytes. Cluster in a closet to 100s of clusters globally. ○ “Microsoft/VMWare is too expensive”. Moving to Proxmox+Ceph. ○ “Data is our product – We need full ownership of the platform.” ○ “Ceph backs the things that make us money – if it’s down we’ll lose $$$ per minute” ○ “Xyz is too expensive, we’re locked in → FOSS Ceph is the best alternative we found” ● Lots and lots of successful uses out there – around 5 exabytes across thousands of clusters. ● But common themes – pain points – are emerging: ○ Ceph performance is not obvious – selecting hardware, NVMe, Crimson, multi-MDS, … ○ Ceph is still too difficult to understand and operate. #AI-OPS to the rescue? 18
  • 19. #AI-Ops ?? 19 ChatGPT’s recommendations vary between useless and very dangerous. Warning: do not use in production!
  • 20. Ceph Cluster Analyzer ● I want to build tools that help people run Ceph. ● Step 1: a website which will grade your ceph cluster. ● Try it now: ○ https://analyzer.clyso.com ● Coming soonTM ○ Clyso Enterprise Storage ○ Ceph Copilot ○ Chorus Multisite S3 20