Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ceph Intro & Architectural OverviewRoss TurkVP Community, Inktank
ME ME ME ME ME ME.2I made a slide today. It’s all about me.Ross TurkVP Community, Inktankross@inktank.com@rossturkinktank....
3CLOUD SERVICESCOMPUTE NETWORK STORAGEthe future of storage™
4HUMANCOMPUTER TAPEHUMANROCKHUMANINKPAPER
5HUMANCOMPUTER TAPE
6YOUTECHNOLOGYYOUR DATA
7How Much Store Things All HumanHistory?!writingpapercomputersdistributed storagecloud computinggaaaaaaaaahhhh!!!!!!carving
8HUMANCOMPUTERDISKDISKDISKDISKDISKDISKDISKHUMANHUMAN
9DISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUM...
10DISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHU...
11DISKCOMPUTERHUMANHUMANHUMANDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPU...
12HUMANHUMANHUMANDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPU...
13DISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTER“STORAGE APPLIANCE”
Storage ApplianceMichael Moll, Wikipedia / CC BY-SA 2.0 14
SUPPORT ANDMAINTENANCEPROPRIETARYSOFTWARE15PROPRIETARYHARDWAREDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTER34% of 2012 ...
161010100110101011001100110010110011010111001100111001010011THE CLOUD
SUPPORT ANDMAINTENANCEPROPRIETARYSOFTWARE17PROPRIETARYHARDWAREDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERSTANDARDHARD...
18
19OPENSOURCECOMMUNITY-FOCUSEDSCALABLENO SINGLE POINT OFFAILURESOFTWAREBASEDSELF-MANAGINGphilosophy design
208 years & 20,000 commits later…
21CEPHOBJECT GATEWAYA powerful S3- and Swift-compatible gateway thatbrings the power of theCeph Object Store tomodern appl...
22RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage node...
23RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage node...
24DISKFSDISK DISKOSDDISK DISKOSD OSD OSD OSDFS FS FSFSbtrfsxfsext4MMM
25MMMHUMAN
26Monitors:• Maintain cluster membershipand state• Provide consensus fordistributed decision-making• Small, odd number• Th...
27RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes...
LIBRADOSMMM28APPsocket
LLIBRADOS• Provides direct access toRADOS for applications• C, C++, Python, PHP, Java, Erlang• Direct access to storage no...
30RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage node...
31MMMLIBRADOSRADOSGWAPPsocketREST
32RADOS Gateway:• REST-based object storageproxy• Uses RADOS to store objects• API supportsbuckets, accounts• Usage accoun...
33RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes...
34MMMVMLIBRADOSLIBRBDVIRTUALIZATION CONTAINER
LIBRADOS35MMMLIBRBDCONTAINERLIBRADOSLIBRBDCONTAINERVM
LIBRADOS36MMMKRBD (KERNEL MODULE)HOST
37RADOS Block Device:• Storage of disk images inRADOS• Decouples VMs from host• Images are striped across thecluster (pool...
38RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes...
39MMMCLIENT0110datametadata
40Metadata Server• Manages metadata for aPOSIX-compliant sharedfilesystem• Directory hierarchy• File metadata(owner, times...
What Makes Ceph Unique?Part one: CRUSH41
42APP??DCDCDCDCDCDCDCDCDCDCDCDC
How Long Did It Take You To Find Your Keys This Morning?azmeen, Flickr / CC BY 2.0 43
44APPDCDCDCDCDCDCDCDCDCDCDCDC
Dear Diary: Today I Put My Keys on the Kitchen CounterBarnaby, Flickr / CC BY 2.0 45
46APPDCDCDCDCDCDCDCDCDCDCDCDCA-GH-NO-TU-ZF*
I Always Put My Keys on the Hook By the Doorvitamindave, Flickr / CC BY 2.0 47
HOW DO YOUFIND YOUR KEYSWHEN YOUR HOUSEISINFINITELY BIGANDALWAYS CHANGING?48
The Answer: CRUSH!!!!!pasukaru76, Flickr / CC SA 2.0 49
5010 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10hash(object name) % num pgCRUSH(pg, cluster state, rule set)
5110 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
52CRUSH• Pseudo-random placementalgorithm• Fast calculation, no lookup• Repeatable, deterministic• Statistically uniform d...
53CLIENT??
54NAME:"foo"POOL:"bar"0101 11111001 00111010 11010011 1011 "bar" = 3hash("foo") % 256 = 0x23OBJECT PLACEMENT GROUP24312CRU...
55
56
57CLIENT??
What Makes Ceph UniquePart two: thin provisioning58
LIBRADOS59MMMVMLIBRBDVIRTUALIZATION CONTAINER
HOW DO YOUSPIN UPTHOUSANDS OF VMsINSTANTLYANDEFFICIENTLY?60
144610 0 0 0instant copy= 144
414462CLIENTwritewritewrite= 148write
414463CLIENTreadreadread= 148
What Makes Ceph Unique?Part three: clustered metadata64
POSIX Filesystem MetadataBarnaby, Flickr / CC BY 2.0 65
66MMMCLIENT0110
67MMM
68one treethree metadata servers??
69
70
71
72
73DYNAMIC SUBTREE PARTITIONING
Getting Started With CephRead about the latest version of Ceph.• The latest stuff is always at http://ceph.com/getDeploy a...
Getting Involved With CephMost project discussion happens on the mailing list.• Join or view archives at http://ceph.com/l...
Ceph Cuttlefish (v0.61.x)1. New ceph-deploy provisioning tool2. New Chef cookbooks3. Fully-tested packages for RHEL (in EP...
Questions?77Ross TurkVP Community, Inktankross@inktank.com@rossturkinktank.com | ceph.com
Upcoming SlideShare
Loading in …5
×

Ceph Intro and Architectural Overview by Ross Turk

15,070 views

Published on

Published in: Technology
  • Be the first to comment

Ceph Intro and Architectural Overview by Ross Turk

  1. 1. Ceph Intro & Architectural OverviewRoss TurkVP Community, Inktank
  2. 2. ME ME ME ME ME ME.2I made a slide today. It’s all about me.Ross TurkVP Community, Inktankross@inktank.com@rossturkinktank.com | ceph.com
  3. 3. 3CLOUD SERVICESCOMPUTE NETWORK STORAGEthe future of storage™
  4. 4. 4HUMANCOMPUTER TAPEHUMANROCKHUMANINKPAPER
  5. 5. 5HUMANCOMPUTER TAPE
  6. 6. 6YOUTECHNOLOGYYOUR DATA
  7. 7. 7How Much Store Things All HumanHistory?!writingpapercomputersdistributed storagecloud computinggaaaaaaaaahhhh!!!!!!carving
  8. 8. 8HUMANCOMPUTERDISKDISKDISKDISKDISKDISKDISKHUMANHUMAN
  9. 9. 9DISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANCOMPUTER
  10. 10. 10DISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKDISKHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANHUMANGIANTSPENDYCOMPUTER
  11. 11. 11DISKCOMPUTERHUMANHUMANHUMANDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTER
  12. 12. 12HUMANHUMANHUMANDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTER
  13. 13. 13DISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTER“STORAGE APPLIANCE”
  14. 14. Storage ApplianceMichael Moll, Wikipedia / CC BY-SA 2.0 14
  15. 15. SUPPORT ANDMAINTENANCEPROPRIETARYSOFTWARE15PROPRIETARYHARDWAREDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTER34% of 2012 revenue(5.2 billion dollars)1.1 billion in R&Dspent in 20121.6 million square feetof manufacturing space
  16. 16. 161010100110101011001100110010110011010111001100111001010011THE CLOUD
  17. 17. SUPPORT ANDMAINTENANCEPROPRIETARYSOFTWARE17PROPRIETARYHARDWAREDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERSTANDARDHARDWAREDISKCOMPUTERDISKCOMPUTERDISKCOMPUTERDISKCOMPUTEROPEN SOURCESOFTWAREENTERPRISESUBSCRIPTION(optional)
  18. 18. 18
  19. 19. 19OPENSOURCECOMMUNITY-FOCUSEDSCALABLENO SINGLE POINT OFFAILURESOFTWAREBASEDSELF-MANAGINGphilosophy design
  20. 20. 208 years & 20,000 commits later…
  21. 21. 21CEPHOBJECT GATEWAYA powerful S3- and Swift-compatible gateway thatbrings the power of theCeph Object Store tomodern applicationsCEPHBLOCKDEVICEA distributed virtual blockdevice that delivers high-performance, cost-effective storage forvirtual machines andlegacy applicationsCEPHFILESYSTEMA distributed, scale-outfilesystem with POSIXsemantics that providesstorage for a legacy andmodern applicationsOBJECTS VIRTUAL DISKS FILES & DIRECTORIESCEPH STORAGECLUSTERA reliable, easy to manage, next-generation distributed objectstore that provides storage of unstructured data for applications
  22. 22. 22RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOSA library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBDA reliable and fully-distributed blockdevice, with a Linuxkernel client and aQEMU/KVM driverCEPH FSA POSIX-compliantdistributed filesystem, with a Linuxkernel client andsupport for FUSERADOSGWA bucket-based RESTgateway, compatiblewith S3 and SwiftAPP APP HOST/VM CLIENT
  23. 23. 23RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOSA library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBDA reliable and fully-distributed blockdevice, with a Linuxkernel client and aQEMU/KVM driverCEPH FSA POSIX-compliantdistributed filesystem, with a Linuxkernel client andsupport for FUSERADOSGWA bucket-based RESTgateway, compatiblewith S3 and SwiftAPP APP HOST/VM CLIENT
  24. 24. 24DISKFSDISK DISKOSDDISK DISKOSD OSD OSD OSDFS FS FSFSbtrfsxfsext4MMM
  25. 25. 25MMMHUMAN
  26. 26. 26Monitors:• Maintain cluster membershipand state• Provide consensus fordistributed decision-making• Small, odd number• These do not serve storedobjects to clientsMOSDs:• 10s to 10000s in a cluster• One per disk• (or one per SSD, RAID group…)• Serve stored objects toclients• Intelligently peer to performreplication and recovery tasks
  27. 27. 27RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodesLIBRADOSA library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBDA reliable and fully-distributed blockdevice, with a Linuxkernel client and aQEMU/KVM driverCEPH FSA POSIX-compliantdistributed filesystem, with a Linuxkernel client andsupport for FUSERADOSGWA bucket-based RESTgateway, compatiblewith S3 and SwiftAPP APP HOST/VM CLIENT
  28. 28. LIBRADOSMMM28APPsocket
  29. 29. LLIBRADOS• Provides direct access toRADOS for applications• C, C++, Python, PHP, Java, Erlang• Direct access to storage nodes• No HTTP overhead
  30. 30. 30RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodesLIBRADOSA library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBDA reliable and fully-distributed blockdevice, with a Linuxkernel client and aQEMU/KVM driverCEPH FSA POSIX-compliantdistributed filesystem, with a Linuxkernel client andsupport for FUSERADOSGWA bucket-based RESTgateway, compatiblewith S3 and SwiftAPP APP HOST/VM CLIENT
  31. 31. 31MMMLIBRADOSRADOSGWAPPsocketREST
  32. 32. 32RADOS Gateway:• REST-based object storageproxy• Uses RADOS to store objects• API supportsbuckets, accounts• Usage accounting for billing• Compatible with S3 andSwift applications
  33. 33. 33RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodesLIBRADOSA library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPCEPH FSA POSIX-compliantdistributed filesystem, with a Linuxkernel client andsupport for FUSERADOSGWA bucket-based RESTgateway, compatiblewith S3 and SwiftAPP APP HOST/VM CLIENTRBDA reliable and fully-distributed blockdevice, with a Linuxkernel client and aQEMU/KVM driver
  34. 34. 34MMMVMLIBRADOSLIBRBDVIRTUALIZATION CONTAINER
  35. 35. LIBRADOS35MMMLIBRBDCONTAINERLIBRADOSLIBRBDCONTAINERVM
  36. 36. LIBRADOS36MMMKRBD (KERNEL MODULE)HOST
  37. 37. 37RADOS Block Device:• Storage of disk images inRADOS• Decouples VMs from host• Images are striped across thecluster (pool)• Snapshots• Copy-on-write clones• Support in:• Mainline Linux Kernel (2.6.39+)• Qemu/KVM, native Xen comingsoon• OpenStack, CloudStack, Nebula,Proxmox
  38. 38. 38RADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodesLIBRADOSA library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHPRBDA reliable and fully-distributed blockdevice, with a Linuxkernel client and aQEMU/KVM driverCEPH FSA POSIX-compliantdistributed filesystem, with a Linuxkernel client andsupport for FUSERADOSGWA bucket-based RESTgateway, compatiblewith S3 and SwiftAPP APP HOST/VM CLIENT
  39. 39. 39MMMCLIENT0110datametadata
  40. 40. 40Metadata Server• Manages metadata for aPOSIX-compliant sharedfilesystem• Directory hierarchy• File metadata(owner, timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data toclients• Only required for sharedfilesystem
  41. 41. What Makes Ceph Unique?Part one: CRUSH41
  42. 42. 42APP??DCDCDCDCDCDCDCDCDCDCDCDC
  43. 43. How Long Did It Take You To Find Your Keys This Morning?azmeen, Flickr / CC BY 2.0 43
  44. 44. 44APPDCDCDCDCDCDCDCDCDCDCDCDC
  45. 45. Dear Diary: Today I Put My Keys on the Kitchen CounterBarnaby, Flickr / CC BY 2.0 45
  46. 46. 46APPDCDCDCDCDCDCDCDCDCDCDCDCA-GH-NO-TU-ZF*
  47. 47. I Always Put My Keys on the Hook By the Doorvitamindave, Flickr / CC BY 2.0 47
  48. 48. HOW DO YOUFIND YOUR KEYSWHEN YOUR HOUSEISINFINITELY BIGANDALWAYS CHANGING?48
  49. 49. The Answer: CRUSH!!!!!pasukaru76, Flickr / CC SA 2.0 49
  50. 50. 5010 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10hash(object name) % num pgCRUSH(pg, cluster state, rule set)
  51. 51. 5110 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
  52. 52. 52CRUSH• Pseudo-random placementalgorithm• Fast calculation, no lookup• Repeatable, deterministic• Statistically uniform distribution• Stable mapping• Limited data migration on change• Rule-based configuration• Infrastructure topology aware• Adjustable replication• Weighting
  53. 53. 53CLIENT??
  54. 54. 54NAME:"foo"POOL:"bar"0101 11111001 00111010 11010011 1011 "bar" = 3hash("foo") % 256 = 0x23OBJECT PLACEMENT GROUP24312CRUSH TARGET OSDsPLACEMENT GROUP3.233.23
  55. 55. 55
  56. 56. 56
  57. 57. 57CLIENT??
  58. 58. What Makes Ceph UniquePart two: thin provisioning58
  59. 59. LIBRADOS59MMMVMLIBRBDVIRTUALIZATION CONTAINER
  60. 60. HOW DO YOUSPIN UPTHOUSANDS OF VMsINSTANTLYANDEFFICIENTLY?60
  61. 61. 144610 0 0 0instant copy= 144
  62. 62. 414462CLIENTwritewritewrite= 148write
  63. 63. 414463CLIENTreadreadread= 148
  64. 64. What Makes Ceph Unique?Part three: clustered metadata64
  65. 65. POSIX Filesystem MetadataBarnaby, Flickr / CC BY 2.0 65
  66. 66. 66MMMCLIENT0110
  67. 67. 67MMM
  68. 68. 68one treethree metadata servers??
  69. 69. 69
  70. 70. 70
  71. 71. 71
  72. 72. 72
  73. 73. 73DYNAMIC SUBTREE PARTITIONING
  74. 74. Getting Started With CephRead about the latest version of Ceph.• The latest stuff is always at http://ceph.com/getDeploy a test cluster using ceph-deploy.• Read the quick-start guide at http://ceph.com/qsgDeploy a test cluster on the AWS free-tier using Juju.• Read the guide at http://ceph.com/jujuRead the rest of the docs!• Find docs for the latest release at http://ceph.com/docs74Have a working cluster up quickly.
  75. 75. Getting Involved With CephMost project discussion happens on the mailing list.• Join or view archives at http://ceph.com/listIRC is a great place to get help (or help others!)• Find details and historical logs at http://ceph.com/ircThe tracker manages our bugs and feature requests.• Register and start looking around at http://ceph.com/trackerDoc updates and suggestions are always welcome.• Learn how to contribute docs at http://ceph.com/docwriting75Help build the best storage system around!
  76. 76. Ceph Cuttlefish (v0.61.x)1. New ceph-deploy provisioning tool2. New Chef cookbooks3. Fully-tested packages for RHEL (in EPEL)4. RGW authentication management API5. RADOS pool quotas6. New ceph df7. RBD incremental snapshots76Best Ceph ever.
  77. 77. Questions?77Ross TurkVP Community, Inktankross@inktank.com@rossturkinktank.com | ceph.com

×