Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE

•

1 like•1,235 views

Audience Level Intermediate Synopsis Ceph – the most popular storage solution for OpenStack – stores all data as a collection of objects. This object store was originally implemented on top of a POSIX filesystem, an approach that turned out to have a number of problems, notably with performance and complexity. BlueStore, a new storage backend for Ceph, was created to solve these issues; the Ceph Jewel release included an early prototype. The code and on-disk format were declared stable (but experimental) for Ceph Kraken, and now in the upcoming Ceph Luminous release, BlueStore will be the recommended default storage backend. With a 2-3x performance boost, you’ll want to look at migrating your Ceph clusters to BlueStore. This talk goes into detail about what BlueStore does, the problems it solves, and what you need to do to use it. Speaker Bio: Tim works for SUSE, hacking on Ceph and related technologies. He has spoken often about distributed storage and high availability at conferences such as linux.conf.au. In his spare time he wrangles pigs, chickens, sheep and ducks, and was declared by one colleague “teammate most likely to survive the zombie apocalypse”.

Technology

Understanding BlueStore
Ceph’s New Storage Backend
Tim Serong
Senior Clustering Engineer
SUSE
tserong@suse.com

2
Introduction to Ceph (the short, short version)

3
Ceph Provides
●
A resilient, scale-out storage cluster
●
On commodity hardware
●
No bottlenecks
●
No single points of failure
●
Three interfaces
●
Object (radosgw)
●
Block (rbd)
●
Distributed Filesystem (cephfs)

4
Internally, Regardless of Interface
●
All data stored as “objects”
●
Aggregated into Placement Groups
●
Objects have:
●
Data (byte stream)
●
Attributes (key/value pairs)
●
Objects live on Object Storage Devices
●
Managed by Ceph Object Storage Daemons
●
An OSD is a disk, in a server

5
More Internally
M
M
M
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD
FS
Disk
OSD

6
Even More Internally
/var/lib/ceph/osd/ceph-0/
current/
0.1_head/
object1
object12
0.7_head/
object3
object5
0.a_head/
object4
object6
meta/
# OSD maps
omap/
# leveldb files
●
FileStore
●
XFS (previously ext4, btrfs)
●
Placement Group → directory
●
Object → file
●
LevelDB for key/value data

8
Double Writes
●
No atomic write/update on POSIX filesystems
●
Solution: Ceph Journal
●
Write → OSD journal (O_DIRECT)
●
Write → OSD filesystem (buffered)
●
Sync filesystem, trim journal
●
Throughput halved

9
Enumeration Sucks
●
Objects distributed by 32-bit hash
●
Enumerated for scrub, rebalance etc.
●
POSIX readdir() isn’t ordered
●
Directory tree with hash-value prefix
●
Split when > ~100 files
●
Merged when < ~50 files
●
Complicated, odd performance issues
...
/DIR_1
/DIR_1/DIR_0/obj-xxxxxx01
/DIR_1/DIR_4/obj-xxxxxx41
/DIR_1/DIR_8/obj-xxxxxx81
/DIR_1/DIR_C/obj-xxxxxxC1
...
/DIR_A/DIR_1/obj-xxxxx1A
/DIR_A/DIR_5/obj-xxxxx5A
/DIR_A/DIR_9/obj-xxxxx9A
/DIR_A/DIR_D/obj-xxxxxDA
...

10
Other Problems
●
CPU utilization
●
Unpredictable filesystem flushing
●
Still finding bugs/issues with FileStore...

12
BlueStore = Block + NewStore
●
Raw block devices
●
RocksDB key/value store for metadata
●
Data written direct to block device
●
2-3x performance boost over FileStore

13
Internally
BlueStore
data metadata
Block Device
RocksDB
BlueRocksEnv
BlueFS
●
All data and metadata on same
block device by default
●
Optionally can be split, e.g.:
●
Data on HDD
●
DB on SSD
●
WAL on NVRAM

14
Bonus!
●
Pluggable block allocator
●
Pluggable compression
●
Checksums on reads
●
RBD and CephFS usable with EC pools

15
Availability
●
Early prototype in Ceph Jewel (April 2016)
●
Stable on-disk format in Ceph Kraken (January 2017)
●
Recommended default in Ceph Luminous (RSN)
●
Can co-exist with FileStore

17
Several Approaches
●
Fail in place
●
Fail FileStore OSD
●
Create new BlueStore OSD on same device
●
Disk-wise replacement
●
Create new BlueStore OSD on spare disk
●
Stop FileStore OSD on same host
●
Host-wise replacement
●
Provision entire new host with BlueStore OSDs
●
Swap into old host’s CRUSH position
●
Evacuate and rebuild in place
●
Evacuate FileStore OSD, fail when empty
●
Create new BlueStore OSD on same device

18
Several Approaches
●
Fail in place → period of reduced redundancy
●
Fail FileStore OSD
●
Create new BlueStore OSD on same device
●
Disk-wise replacement → reduced online redundancy, requires extra drive slot
●
Create new BlueStore OSD on spare disk
●
Stop FileStore OSD on same host
●
Host-wise replacement → no reduced redundancy, requires spare host
●
Provision entire new host with BlueStore OSDs
●
Swap into old host’s CRUSH position
●
Evacuate and rebuild in place → no reduced redundancy, requires free disk space
●
Evacuate FileStore OSD, fail when empty
●
Create new BlueStore OSD on same device

What's hot

[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community

Ceph issue 해결 사례Open Source Consulting

How to Survive an OpenStack Cloud Meltdown with CephSean Cohen

Kicking ass with redisDvir Volk

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Ceph Introduction 2017 Karan Singh

libradosPatrick McGarry

Performance optimization for all flash based on aarch64 v2.0Ceph Community

Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCeph Community

Ceph as software define storageMahmoud Shiri Varamini

2021.02 new in Ceph Pacific DashboardCeph Community

BlueStore: a new, faster storage backend for CephSage Weil

State transfer With GaleraMydbops

A crash course in CRUSHSage Weil

2019.06.27 Intro to CephCeph Community

Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh

Ceph on WindowsCeph Community

RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedis Labs

What's hot (20)

[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...

Ceph issue 해결 사례

How to Survive an OpenStack Cloud Meltdown with Ceph

Kicking ass with redis

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

Trend Micro Big Data Platform and Apache Bigtop

Ceph Introduction 2017

librados

Performance optimization for all flash based on aarch64 v2.0

Performance tuning in BlueStore & RocksDB - Li Xiaoyan

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Ceph as software define storage

2021.02 new in Ceph Pacific Dashboard

BlueStore: a new, faster storage backend for Ceph

State transfer With Galera

A crash course in CRUSH

2019.06.27 Intro to Ceph

Ceph Object Storage Reference Architecture Performance and Sizing Guide

Ceph on Windows

RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman

Similar to Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE

Ceph storage for ocp deploying and managing ceph on top of open shift conta...OrFriedmann

Scale out backups-with_bareos_and_glusterGluster.org

OSBConf 2015 | Scale out backups with bareos and gluster by niels de vosNETWAYS

Selecting the right persistent storage options for apps in containers Open So...bipin kunal

CephFS in Jewel: Stable at LastCeph Community

Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis

Gluster intro-tdoseGluster.org

Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangCeph Community

Ceph in 2023 and Beyond.pdfClyso GmbH

Ceph Day New York: Ceph: one decade inCeph Community

Bsdtw17: allan jude: zfs: advanced integrationScott Tsai

Introduction into Ceph storage for OpenStackOpenStack_Online

Linuxtag.ceph.talkUdo Seidel

GlusterFS Talk for CentOS Dojo BangaloreRaghavendra Talur

Ceph Tech Talk: Ceph at DigitalOceanCeph Community

adp.ceph.openstack.talkUdo Seidel

London Ceph Day Keynote: Building Tomorrow's Ceph Ceph Community

Kafka on ZFS: Better Living Through Filesystems confluent

Similar to Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE (20)

Ceph storage for ocp deploying and managing ceph on top of open shift conta...

Scale out backups-with_bareos_and_gluster

OSBConf 2015 | Scale out backups with bareos and gluster by niels de vos

Selecting the right persistent storage options for apps in containers Open So...

CephFS in Jewel: Stable at Last

Open Source Storage at Scale: Ceph @ GRNET

Gluster intro-tdose

Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang

Ceph in 2023 and Beyond.pdf

Ceph Day New York: Ceph: one decade in

Bsdtw17: allan jude: zfs: advanced integration

Introduction into Ceph storage for OpenStack

Linuxtag.ceph.talk

GlusterFS Talk for CentOS Dojo Bangalore

Ceph Tech Talk: Ceph at DigitalOcean

adp.ceph.openstack.talk

London Ceph Day Keynote: Building Tomorrow's Ceph

Kafka on ZFS: Better Living Through Filesystems

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

AI as an Interface for Commercial BuildingsMemoori

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Pigging Solutions in Pet Food ManufacturingPigging Solutions

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Install Stable Diffusion in windows machinePadma Pradeep

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

WordPress Websites for Engineers: Elevate Your Brandgvaughan

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

"ML in Production",Oleksandr BaganFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024

AI as an Interface for Commercial Buildings

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Scanning the Internet for External Cloud Exposures via SSL Certs

Dev Dives: Streamline document processing with UiPath Studio Web

Pigging Solutions in Pet Food Manufacturing

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Powerpoint exploring the locations used in television show Time Clash

Install Stable Diffusion in windows machine

Advanced Test Driven-Development @ php[tek] 2024

WordPress Websites for Engineers: Elevate Your Brand

SIP trunking in Janus @ Kamailio World 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Ensuring Technical Readiness For Copilot in Microsoft 365

SQL Database Design For Developers at php[tek] 2024

"ML in Production",Oleksandr Bagan

Nell’iperspazio con Rocket: il Framework Web di Rust!

Unraveling Multimodality with Large Language Models.pdf

SAP Build Work Zone - Overview L2-L3.pptx

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE

1. Understanding BlueStore Ceph’s New Storage Backend Tim Serong Senior Clustering Engineer SUSE tserong@suse.com

2. 2 Introduction to Ceph (the short, short version)

3. 3 Ceph Provides ● A resilient, scale-out storage cluster ● On commodity hardware ● No bottlenecks ● No single points of failure ● Three interfaces ● Object (radosgw) ● Block (rbd) ● Distributed Filesystem (cephfs)

4. 4 Internally, Regardless of Interface ● All data stored as “objects” ● Aggregated into Placement Groups ● Objects have: ● Data (byte stream) ● Attributes (key/value pairs) ● Objects live on Object Storage Devices ● Managed by Ceph Object Storage Daemons ● An OSD is a disk, in a server

5. 5 More Internally M M M FS Disk OSD FS Disk OSD FS Disk OSD FS Disk OSD FS Disk OSD FS Disk OSD

6. 6 Even More Internally /var/lib/ceph/osd/ceph-0/ current/ 0.1_head/ object1 object12 0.7_head/ object3 object5 0.a_head/ object4 object6 meta/ # OSD maps omap/ # leveldb files ● FileStore ● XFS (previously ext4, btrfs) ● Placement Group → directory ● Object → file ● LevelDB for key/value data

7. 7 Problems with FileStore

8. 8 Double Writes ● No atomic write/update on POSIX filesystems ● Solution: Ceph Journal ● Write → OSD journal (O_DIRECT) ● Write → OSD filesystem (buffered) ● Sync filesystem, trim journal ● Throughput halved

9. 9 Enumeration Sucks ● Objects distributed by 32-bit hash ● Enumerated for scrub, rebalance etc. ● POSIX readdir() isn’t ordered ● Directory tree with hash-value prefix ● Split when > ~100 files ● Merged when < ~50 files ● Complicated, odd performance issues ... /DIR_1 /DIR_1/DIR_0/obj-xxxxxx01 /DIR_1/DIR_4/obj-xxxxxx41 /DIR_1/DIR_8/obj-xxxxxx81 /DIR_1/DIR_C/obj-xxxxxxC1 ... /DIR_A/DIR_1/obj-xxxxx1A /DIR_A/DIR_5/obj-xxxxx5A /DIR_A/DIR_9/obj-xxxxx9A /DIR_A/DIR_D/obj-xxxxxDA ...

10. 10 Other Problems ● CPU utilization ● Unpredictable filesystem flushing ● Still finding bugs/issues with FileStore...

11. 11 Solutions with BlueStore

12. 12 BlueStore = Block + NewStore ● Raw block devices ● RocksDB key/value store for metadata ● Data written direct to block device ● 2-3x performance boost over FileStore

13. 13 Internally BlueStore data metadata Block Device RocksDB BlueRocksEnv BlueFS ● All data and metadata on same block device by default ● Optionally can be split, e.g.: ● Data on HDD ● DB on SSD ● WAL on NVRAM

14. 14 Bonus! ● Pluggable block allocator ● Pluggable compression ● Checksums on reads ● RBD and CephFS usable with EC pools

15. 15 Availability ● Early prototype in Ceph Jewel (April 2016) ● Stable on-disk format in Ceph Kraken (January 2017) ● Recommended default in Ceph Luminous (RSN) ● Can co-exist with FileStore

16. 16 Migration Strategies

17. 17 Several Approaches ● Fail in place ● Fail FileStore OSD ● Create new BlueStore OSD on same device ● Disk-wise replacement ● Create new BlueStore OSD on spare disk ● Stop FileStore OSD on same host ● Host-wise replacement ● Provision entire new host with BlueStore OSDs ● Swap into old host’s CRUSH position ● Evacuate and rebuild in place ● Evacuate FileStore OSD, fail when empty ● Create new BlueStore OSD on same device

18. 18 Several Approaches ● Fail in place → period of reduced redundancy ● Fail FileStore OSD ● Create new BlueStore OSD on same device ● Disk-wise replacement → reduced online redundancy, requires extra drive slot ● Create new BlueStore OSD on spare disk ● Stop FileStore OSD on same host ● Host-wise replacement → no reduced redundancy, requires spare host ● Provision entire new host with BlueStore OSDs ● Swap into old host’s CRUSH position ● Evacuate and rebuild in place → no reduced redundancy, requires free disk space ● Evacuate FileStore OSD, fail when empty ● Create new BlueStore OSD on same device

19. 19 Questions?

20.

21. 21

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE

Similar to Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE (20)

More from OpenStack

More from OpenStack (20)

Recently uploaded

Recently uploaded (20)

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE