Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
What you need to
know about Ceph
Gluster Community Day, 20 May 2014
Haruka Iwao
Index
What is Ceph?
Ceph architecture
Ceph and OpenStack
Wrap-up
What is Ceph?
Ceph
The name "Ceph" is a
common nickname given to
pet octopuses, short for
cephalopod.
Cephalopod?
Ceph is...
{ }
object storage and file system
Open-source
Massively scalable
Software-defined
History of Ceph
2003 Project born at UCSC
2006 Open sourced
Papers published
2012 Inktank founded
“Argonaut” released
In April 2014
Yesterday
Red Hat acquires me
I joined Red Hat as an
architect of storage systems
This is just a coincidence.
Red Hat acqu...
Ceph releases
Major release every 3 months
Argonaut
Bobtail
Cuttlefish
Dumpling
Emperor
Firefly
Giant (coming in J...
Ceph architecture
Ceph at a glance
Layers in Ceph
RADOS = /dev/sda
Ceph FS = ext4
/dev/sda
ext4
RADOS
Reliable
Replicated to avoid data loss
Autonomic
Communicate each other to
detect failures
Replication done tra...
RADOS (2)
Fundamentals of Ceph
Everything is stored in
RADOS
Including Ceph FS metadata
Two components: mon, osd
CRUS...
OSD
Object storage daemon
One OSD per disk
Uses xfs/btrfs as backend
Btrfs is experimental!
Write-ahead journal for
i...
OSD (2)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS
btrfs
xfs
ext4
MON
Monitoring daemon
Maintain cluster map and
state
Small, odd number
Locating objects
RADOS uses an algorithm
“CRUSH” to locate objects
Location is decided through
pure “calculation”
No ce...
CRUSH
1. Assign a placement group
pg = Hash(object name) % num pg
2. CRUSH(pg, cluster map, rule)
1
2
Cluster map
Hierarchical OSD map
Replicating across failure domains
Avoiding network congestion
Object locations computed
0100111010100111011 Name: abc, Pool: test
Hash(“abc”) % 256 = 0x23
“test” = 3
Placement Group: 3...
PG to OSD
Placement Group: 3.23
CRUSH(PG 3.23, Cluster Map, Rule)
→ osd.1, osd.5, osd.9
1
5
9
Synchronous Replication
Replication is synchronous
to maintain strong consistency
When OSD fails
OSD marked “down”
5mins later, marked “out”
Cluster map updated
CRUSH(PG 3.23, Cluster Map #1, Rule)
→ o...
Wrap-up: CRUSH
Object name + cluster map
→ object locations
Deterministic
No metadata at all
Calculation done on clien...
RADOSGW
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent stora...
RADOSGW
S3 / Swift compatible
gateway to RADOS
RBD
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage n...
RBD
RADOS Block Devices
RBD
Directly mountable
rbd map foo --pool rbd
mkfs -t ext4 /dev/rbd/rbd/foo
OpenStack integration
Cinder & Glance
Wi...
Ceph FS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent stora...
Ceph FS
POSIX compliant file system
build on top of RADOS
Can mount with Linux native
kernel driver (cephfs) or FUSE
Me...
Ceph FS is reliable
MDS writes journal to RADOS
so that metadata doesn’t
lose by MDS failures
Multiple MDS can run for H...
Ceph FS and OSD
MDS
OSDOSDOSD
POSIX Metadata
(directory, time, owner, etc)
MDS
Write metadata journal
Data I/O
Metadata he...
DYNAMIC SUBTREE PARTITIONING
Ceph FS is experimental
Other features
Rolling upgrades
Erasure Coding
Cache tiering
Key-value OSD backend
Separate backend network
Rolling upgrades
No interruption to the
service when upgrading
Stop/Start daemons one by
one
mon → osd → mds →
radowgw
Erasure coding
Use erasure coding instead
of parity for data durability
Suitable for rarely modified
or accessed objects...
Cache tiering
Cache tier
ex. SSD
Base tier
ex. HDD,
erasure coded
librados
transparent to clients
read/write
read when mis...
Key-value OSD backend
Use LevelDB for OSD
backend (instead of xfs)
Better performance esp for
small objects
Plans to su...
Separate backend network
Backend network
for replication
Clients
Frontend network
for service
OSDs
1. Write
2. Replicate
OpenStack Integration
OpenStack with Ceph
RADOSGW and Keystone
Keystone Server
RADOSGW
RESTful Object Store
Query token
Access with token
Grant/revoke
Glance Integration
RB
D
Glance Server
/etc/glance/glance-api.conf
default_store=rbd
rbd_store_user=glance
rbd_store_pool=i...
Cinder/Nova Integration
RB
D
Cinder Server
qemu
VM
librbd
nova-compute
Boot from volume
Management
Volume Image
Copy-on-wr...
Benefits of using with
Unified storage for both
images and volumes
Copy-on-write cloning and
snapshot support
Native qe...
Wrap-up
Ceph is
Massively scalable storage
Unified architecture for
object / block / POSIX FS
OpenStack integration is
ready to...
Ceph and GlusterFS
Ceph GlusterFS
Distribution Object based File based
File location Deterministic
algorithm
(CRUSH)
Distr...
Further readings
Ceph Documents
https://ceph.com/docs/master/
Well documented.
Sébastien Han
http://www.sebastien-han.fr/b...
One more thing
Calamari will be open sourced
“Calamari, the monitoring
and diagnostics tool that
Inktank has developed as
part of the Ink...
Calamari screens
Thank you!
What you need to know about ceph
What you need to know about ceph
What you need to know about ceph
What you need to know about ceph
What you need to know about ceph
What you need to know about ceph
Upcoming SlideShare
Loading in …5
×

What you need to know about ceph

Introduction to Ceph, an open-source, massively scalable distributed file system.
This document explains the architecture of Ceph and integration with OpenStack.

What you need to know about ceph

  1. 1. What you need to know about Ceph Gluster Community Day, 20 May 2014 Haruka Iwao
  2. 2. Index What is Ceph? Ceph architecture Ceph and OpenStack Wrap-up
  3. 3. What is Ceph?
  4. 4. Ceph The name "Ceph" is a common nickname given to pet octopuses, short for cephalopod.
  5. 5. Cephalopod?
  6. 6. Ceph is... { } object storage and file system Open-source Massively scalable Software-defined
  7. 7. History of Ceph 2003 Project born at UCSC 2006 Open sourced Papers published 2012 Inktank founded “Argonaut” released
  8. 8. In April 2014
  9. 9. Yesterday Red Hat acquires me I joined Red Hat as an architect of storage systems This is just a coincidence. Red Hat acquires me
  10. 10. Ceph releases Major release every 3 months Argonaut Bobtail Cuttlefish Dumpling Emperor Firefly Giant (coming in July)
  11. 11. Ceph architecture
  12. 12. Ceph at a glance
  13. 13. Layers in Ceph RADOS = /dev/sda Ceph FS = ext4 /dev/sda ext4
  14. 14. RADOS Reliable Replicated to avoid data loss Autonomic Communicate each other to detect failures Replication done transparently Distributed Object Store
  15. 15. RADOS (2) Fundamentals of Ceph Everything is stored in RADOS Including Ceph FS metadata Two components: mon, osd CRUSH algorithm
  16. 16. OSD Object storage daemon One OSD per disk Uses xfs/btrfs as backend Btrfs is experimental! Write-ahead journal for integrity and performance 3 to 10000s OSDs in a cluster
  17. 17. OSD (2) DISK FS DISK DISK OSD DISK DISK OSD OSD OSD OSD FS FS FSFS btrfs xfs ext4
  18. 18. MON Monitoring daemon Maintain cluster map and state Small, odd number
  19. 19. Locating objects RADOS uses an algorithm “CRUSH” to locate objects Location is decided through pure “calculation” No central “metadata” server No SPoF Massive scalability
  20. 20. CRUSH 1. Assign a placement group pg = Hash(object name) % num pg 2. CRUSH(pg, cluster map, rule) 1 2
  21. 21. Cluster map Hierarchical OSD map Replicating across failure domains Avoiding network congestion
  22. 22. Object locations computed 0100111010100111011 Name: abc, Pool: test Hash(“abc”) % 256 = 0x23 “test” = 3 Placement Group: 3.23
  23. 23. PG to OSD Placement Group: 3.23 CRUSH(PG 3.23, Cluster Map, Rule) → osd.1, osd.5, osd.9 1 5 9
  24. 24. Synchronous Replication Replication is synchronous to maintain strong consistency
  25. 25. When OSD fails OSD marked “down” 5mins later, marked “out” Cluster map updated CRUSH(PG 3.23, Cluster Map #1, Rule) → osd.1, osd.5, osd.9 CRUSH(PG 3.23, Cluster Map #2, Rule) → osd.1, osd.3, osd.9
  26. 26. Wrap-up: CRUSH Object name + cluster map → object locations Deterministic No metadata at all Calculation done on clients Cluster map reflects network hierarchy
  27. 27. RADOSGW RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
  28. 28. RADOSGW S3 / Swift compatible gateway to RADOS
  29. 29. RBD RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
  30. 30. RBD RADOS Block Devices
  31. 31. RBD Directly mountable rbd map foo --pool rbd mkfs -t ext4 /dev/rbd/rbd/foo OpenStack integration Cinder & Glance Will explain later
  32. 32. Ceph FS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
  33. 33. Ceph FS POSIX compliant file system build on top of RADOS Can mount with Linux native kernel driver (cephfs) or FUSE Metadata servers (mds) manages metadata of the file system tree
  34. 34. Ceph FS is reliable MDS writes journal to RADOS so that metadata doesn’t lose by MDS failures Multiple MDS can run for HA and load balancing
  35. 35. Ceph FS and OSD MDS OSDOSDOSD POSIX Metadata (directory, time, owner, etc) MDS Write metadata journal Data I/O Metadata held in-memory
  36. 36. DYNAMIC SUBTREE PARTITIONING
  37. 37. Ceph FS is experimental
  38. 38. Other features Rolling upgrades Erasure Coding Cache tiering Key-value OSD backend Separate backend network
  39. 39. Rolling upgrades No interruption to the service when upgrading Stop/Start daemons one by one mon → osd → mds → radowgw
  40. 40. Erasure coding Use erasure coding instead of parity for data durability Suitable for rarely modified or accessed objects Erasure Coding Replication Space overhead (survive 2 fails) Approx 40% 200% CPU High Low Latency High Low
  41. 41. Cache tiering Cache tier ex. SSD Base tier ex. HDD, erasure coded librados transparent to clients read/write read when miss fetch when miss flush to base tier
  42. 42. Key-value OSD backend Use LevelDB for OSD backend (instead of xfs) Better performance esp for small objects Plans to support RocksDB, NVMKV, etc
  43. 43. Separate backend network Backend network for replication Clients Frontend network for service OSDs 1. Write 2. Replicate
  44. 44. OpenStack Integration
  45. 45. OpenStack with Ceph
  46. 46. RADOSGW and Keystone Keystone Server RADOSGW RESTful Object Store Query token Access with token Grant/revoke
  47. 47. Glance Integration RB D Glance Server /etc/glance/glance-api.conf default_store=rbd rbd_store_user=glance rbd_store_pool=images Store, Download Need just 3 lines! Image
  48. 48. Cinder/Nova Integration RB D Cinder Server qemu VM librbd nova-compute Boot from volume Management Volume Image Copy-on-write clone
  49. 49. Benefits of using with Unified storage for both images and volumes Copy-on-write cloning and snapshot support Native qemu / KVM support for better performance
  50. 50. Wrap-up
  51. 51. Ceph is Massively scalable storage Unified architecture for object / block / POSIX FS OpenStack integration is ready to use & awesome
  52. 52. Ceph and GlusterFS Ceph GlusterFS Distribution Object based File based File location Deterministic algorithm (CRUSH) Distributed hash table, stored in xattr Replication Server side Client side Primary usage Object / block storage POSIX-like file system Challenge POSIX file system needs improvement Object / block storage needs improvement
  53. 53. Further readings Ceph Documents https://ceph.com/docs/master/ Well documented. Sébastien Han http://www.sebastien-han.fr/blog/ An awesome blog. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data http://ceph.com/papers/weil-crush-sc06.pdf CRUSH algorithm paper Ceph: A Scalable, High-Performance Distributed File System http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf Ceph paper Ceph の覚え書きのインデックス http://www.nminoru.jp/~nminoru/unix/ceph/ Well written introduction in Japanese
  54. 54. One more thing
  55. 55. Calamari will be open sourced “Calamari, the monitoring and diagnostics tool that Inktank has developed as part of the Inktank Ceph Enterprise product, will soon be open sourced.” http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf
  56. 56. Calamari screens
  57. 57. Thank you!

    Be the first to comment

    Login to see the comments

  • vonfactory

    Aug. 27, 2014
  • ssuser1b5b33

    Dec. 5, 2014
  • SeanPark1

    Dec. 5, 2014
  • miyagoshi

    Apr. 11, 2015
  • tacy_lee

    Apr. 22, 2015
  • yasufumic

    Apr. 29, 2015
  • minkyoungkim3517

    Jun. 30, 2015
  • mildrain

    Nov. 27, 2015
  • hangphuoc

    Jan. 16, 2016
  • 4nNtt

    Feb. 27, 2016
  • newthinker

    Jun. 10, 2016
  • udomsakc

    Jun. 29, 2016
  • ChengweiYang1

    Aug. 23, 2016
  • yasuhidesakai

    Oct. 15, 2016
  • akshaysaini0007

    Dec. 3, 2016
  • goonshin

    Feb. 23, 2018
  • prashanthrao4

    Apr. 14, 2018
  • leepro

    Sep. 25, 2018
  • KarthikV32

    Dec. 31, 2019
  • AmalRaj76

    Feb. 18, 2020

Introduction to Ceph, an open-source, massively scalable distributed file system. This document explains the architecture of Ceph and integration with OpenStack.

Views

Total views

11,092

On Slideshare

0

From embeds

0

Number of embeds

124

Actions

Downloads

786

Shares

0

Comments

0

Likes

36

×