Your SlideShare is downloading. ×
  • Like
Build A Cloud Day - Chicago
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Build A Cloud Day - Chicago

  • 209 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
209
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • People have been trying to capture knowledge for a very long time. I guess the first form of captured knowledge is the cave painting.
  • TODO: change this slide. Man + magnet + tape = magnetic tape.1000 books on one tape
  • People learned how to store data on magnetic tape.Many, many, many books could be stored on a single tape.
  • TODO: animate so that they show progressively

Transcript

  • 1. S C A L I N G STO R AG E W I T H C E P H Ross Turk, Inktank
  • 2. WHO?Ross TurkVP Community, Inktank ross@inktank.com @rossturkinktank.com | ceph.com
  • 3. me
  • 4. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 5. I N T H E B EG I N N I N GMagic Madzik, Flickr / CC BY 2.0
  • 6. EA R LY I N FO R M AT I O N STO R AG EChico.Ferreira, Flickr / CC BY 2.0
  • 7. W R I T I N G > C AV E PA I N T I N G Skevingessner, Flickr / CC BY-SA 2.0
  • 8. ==x1000 x1
  • 9. P EO P L E B EG I N W R I T I N G A LOTMoyan_Brenn, Flickr / CC BY-ND 2.0
  • 10. W R I T I N G I S T I M E - CO N S U M I N Gtrekkyandy, Flickr / CC BY 2.0
  • 11. T H E I N D U ST R I A L I ZAT I O N O F W R I T I N GFateDenied, Flickr / CC BY 2.0
  • 12. magnet + tape = magnetic tape == x1000 x1
  • 13. STO R AG E B ECO M ES M EC H A N I C A LErik Pitti, Wikipedia / CC BY-ND 2.0
  • 14. HUMA ROCK N INKHUMA N PAPERHUMA COMPUTER TAPE N
  • 15. CO M P U T E RS N E E D P EO P L E TO WO R KUSDAgov, Flickr / CC BY 2.0
  • 16. HUMA COMPUTER TAPE N
  • 17. 11101011 10110110 10110101 10101001 00100100 01001001== 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011
  • 18. T H RO U G H P U T B ECO M ES I M P O RTA N TZane Luke, Flickr / CC BY-ND 2.0
  • 19. L A Z 0 R B 3 A M S C H A N G E E V E RY T H I N G ! !Jeff Kubina, Flickr / CC-BY-SA 2.0
  • 20. H A R D D R I V ES A R E TOTA L LY B E T T E R amazing spinny hard drives sucky stupid tape slow
  • 21. E V E RY T H I N G G E T S M ES SYRob!, Flickr / CC BY 2.0
  • 22. aa ab 111010 ac101 ba bb bc 111 010da 110 db 011 010 000 dc000 110 001
  • 23. file owner: rturk created: aug12 last viewed: aug17 size: 4202511101011 10110110 10110101 perms: 64410101001 00100100 0100100110100100 10100101 0101101001101010 10101010 10101010
  • 24. aa ab 111010 ac101 ba bb bc 111 010da 110 db 01 010 000 dc 10000 110 001
  • 25. W E O U TG ROW T H E H A R D D R I V EMr. T in DC, Flickr / CC BY 2.0
  • 26. DISK DISKHUMA N DISKHUMA COMPUTE DISK N RHUMA DISK N DISK DISK
  • 27. HUMAN HUMAN HUMAN HUMAN DISK HUMANHUMAN DISK HUMAN HUMAN DISK DISK HUMAN DISK HUMANHUMAN DISK (COMPUTER) HUMAN DISK HUMAN HUMAN DISK HUMAN HUMAN DISK HUMAN DISK HUMAN DISK HUMAN HUMAN DISK HUMAN HUMAN (actually more like this…)
  • 28. COMPUTE DISK R COMPUTE DISK RHUMA COMPUTE DISK R N COMPUTE DISK R COMPUTE DISK R COMPUTEHUMA R DISK N COMPUTE DISK R COMPUTE DISK R COMPUTEHUMA R DISK N COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R
  • 29. aa ab 111010 ac101 ba bb bc 111 010da 110 db 011 010 000 dc000 110 001
  • 30. object pace: quick driver: frog license: expired expression: agog11101011 10110110 1011010110101001 00100100 0100100110100100 10100101 0101101001101010 10101010 10101010
  • 31. COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISKAPP R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R
  • 32. COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTECOMPUTE DISK R COMPUTE R R DISK DISK COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R
  • 33. COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK RVM COMPUTE DISK R COMPUTE DISK RVM COMPUTE DISK R COMPUTE DISKVM R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R
  • 34. Ceph Cloud computing Distributed storage Shared storage Computers Writing PaintingS TO R A G E T H R O U G H O U T H I S TO RYTime-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.
  • 35. COMPUTE DISK R COMPUTE DISK RHUMA COMPUTE DISK R N COMPUTE DISK R COMPUTE DISK R COMPUTEHUMA R DISK N COMPUTE DISK R COMPUTE DISK R COMPUTEHUMA R DISK N COMPUTE DISK R COMPUTE DISK R COMPUTE DISK R
  • 36. COMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK RCOMPUTE DISK R
  • 37. C DC DC DC DC DC DC DC DC DC DC DC D
  • 38. C D C DHUMA C D N C D C D C DHUMA N C D C D C DHUMA N C D C D C D
  • 39. STO R AG E A P P L I A N C ESMichael Moll, Wikipedia / CC BY-SA 2.0
  • 40. 6 . 4 M I L L I O N S Q F T O F FAC TO R I ESDude94111, Flickr / CC BY 2.0
  • 41. T EC H N O LO GY I S A CO M M O D I T YRaeAllen, Flickr / CC-BY 2.0
  • 42. CO M M O D I T Y P R I C ES F LU C T UAT EMay-07 May-08 May-09 May-10 May-11 May-12
  • 43. Hardware Appliances are Mysterious Black BoxesAbode of Chaos, Flickr / CC BY 2.0
  • 44. C D C D C D C D C D HUMAN !! C D C D[DEVELOPER] C D C D C D C D C D
  • 45. C D C DC C D C D D C DC+ C D C D+ C D C D C D C D
  • 46. C D C DC C D C D D C DC+ C D C D+ C D C D C D C D
  • 47. THE WORLD NEEDSAN OPEN STORAGE TECHNOLOGY THAT SCALES
  • 48. SAG E W E I L Co-founder of DreamHost Inventor of Ceph CEO of Inktank
  • 49. philosophy design OPENSOURCE
  • 50. O P E N S O U RC E S P R EA D S I D EA Sorchidgalore, Flickr / CC BY 2.0
  • 51. philosophy design OPEN SOURCECOMMUNITY- FOCUSED
  • 52. W E A R E S M A RT E R TO G E T H E Rrturk, Linkedin Inmap
  • 53. C E P H B E LO N G S TO A L L O F U Swackybadger, Flickr / CC BY 2.0
  • 54. philosophy design OPEN SCALABL SOURCE ECOMMUNITY- FOCUSED
  • 55. Ceph Too much for a room Too much for a computer Too much for a drive Too much for a book Too much for a caveC E P H I S B U I LT TO S C A L E
  • 56. philosophy design OPEN SCALABL SOURCE ECOMMUNITY- NO SINGLE POINT OF FOCUSED FAILURE
  • 57. A R I LO M A X C A L I FO R N I C U Saroid, Flickr / CC BY 2.0
  • 58. single point of failure highly-availablereplicatedT H E O C TO P U S ( A M E TA P H O R )I love speaking in metaphors.
  • 59. T H E B E E H I V E ( A N OT H E R M E TA P H O R )blumenbiene, Flickr / CC BY 2.0
  • 60. philosophy design OPEN SCALABL SOURCE ECOMMUNITY- NO SINGLE POINT OF FOCUSED FAILURE SOFTWARE BASED
  • 61. C D C DC C D C D D C DC+ C D C D+ C D C D C D C D
  • 62. C D C DC C D C D D C DC+ C D C D+ C D C D C D C D
  • 63. philosophy design OPEN SCALABL SOURCE ECOMMUNITY- NO SINGLE POINT OF FOCUSED FAILURE SOFTWARE BASED SELF- MANAGING
  • 64. D I S KS = J U ST T I N Y R ECO R D P L AY E RSjon_a_ross, Flickr / CC BY 2.0
  • 65. D D D D D D = D Dx 1 MILLION 55 times / day
  • 66. I T A L L STA RT E D W I T H A D R EA M
  • 67. +
  • 68. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 69. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 70. OSD OSD OSD OSD OSD btrfsFS FS FS FS FS xfs ext4DISK DISK DISK DISK DISK M M M
  • 71. HUMAN MM M
  • 72. Monitors:  Maintain cluster mapM  Provide consensus for distributed decision- making  Must have an odd number  These do not serve stored objects to clients OSDs:  One per disk (recommended)  At least three in a cluster  Serve stored objects to clients  Intelligently peer to perform replication tasks  Supports object classes
  • 73. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 74. APP LIBRADOS native MM M
  • 75. LIBRADOS  Provides direct access toL RADOS for applications  C, C++, Python, PHP, Jav a  No HTTP overhead
  • 76. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 77. APP APP RESTRADOSGW RADOSGW LIBRADOS LIBRADOS native M M M
  • 78. RADOS Gateway: REST-based interface to RADOS Supports buckets, accounting Compatible with S3 and Swift applications
  • 79. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 80. VMVIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  • 81. CONTAINER VM CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M
  • 82. HOST KRBD (KERNEL MODULE) LIBRADOS MM M
  • 83. RADOS Block Device: Storage of virtual disks in RADOS Allows decoupling of VMs and containers  Live migration! Images are striped across the cluster Boot support in QEMU, KVM, and OpenStack Nova Mount support in the Linux kernel
  • 84. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 85. CLIENTmetadata 01 data 10 M M M
  • 86. Metadata Server Manages metadata for a POSIX-compliant shared filesystem  Directory hierarchy  File metadata (owner, timestamps, mo de, etc.) Stores metadata in RADOS Does not serve file data to clients Only required for shared filesystem
  • 87. WHAT MAKES CEPH UNIQUE?
  • 88. H OW D O YO U F I N D YO U R K E YS ?azmeen, Flickr / CC BY 2.0
  • 89. C D C D C D C D C D ??APP C D C D C D C D C D C D C D
  • 90. C D C D A-G C D C D C D H-NAPP F* C D C D C D O-T C D C D C D U-Z C D
  • 91. I A LWAYS P U T M Y K E YS O N T H E H O O Kvitamindave, Flickr / CC BY 2.0
  • 92. C D C D C D C D C DAPP C D C D C D C D C D C D C D
  • 93. D EA R D I A RY: K E YS = I N T H E K I TC H E NBarnaby, Flickr / CC BY 2.0
  • 94. HOW DO YOU FIND YOUR KEYSWHEN YOUR HOUSE IS INFINITELY BIG ANDALWAYS CHANGING?
  • 95. T H E A N SW E R : C R U S H ! !pasukaru76, Flickr / CC SA 2.0
  • 96. 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg10 10 01 01 10 10 01 11 01 10 CRUSH(pg, cluster state, rule set)
  • 97. 10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
  • 98. CRUSH Pseudo-random placement algorithm Ensures even distribution Repeatable, deterministic Rule-based configuration  Replica count  Infrastructure topology  Weighting
  • 99. CLIENT ??
  • 100. CLIENT ??
  • 101. VMVIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  • 102. HOW DO YOU SPIN UPTHOUSANDS OF VMs INSTANTLY AND EFFICIENTLY?
  • 103. instant copy144 0 0 0 0 = 144
  • 104. write CLIENT write write write144 4 = 148
  • 105. read read CLIENT read144 4 = 148
  • 106. HOW DO YOU MANAGEDIRECTORY HEIRARCHY WITHOUT A SINGLE POINT OF FAILURE?
  • 107. F I L ESYST E M S R EQ U I R E M E TA DATABarnaby, Flickr / CC BY 2.0
  • 108. CLIENT 01 10 MM M
  • 109. MM M
  • 110. one treethree metadata servers ??
  • 111. DYNAMIC SUBTREE PARTITIONING
  • 112. AND NOWBACKPEDALING
  • 113. ALMOSTEVERYTHING WORKS
  • 114. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHP AWESOME AWESOME NEARLY AWESOME AWESOMERADOS AWESOMEA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 115. *LAN SCALE!!* OR REALLY REALLY SCARY FAST WAN
  • 116. C E P H A N D C LO U D STAC Ktableatny, Flickr / CC BY 2.0
  • 117. R B D S U P P O RT I N C LO U D STAC K Allows storage of virtual disks inside RADOS  Works with KVM only right now  No snapshots yet Upcoming in CloudStack 4 More information can be found on the mailing list:  ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505
  • 118. Q U EST I O N S ?Ross TurkVP Community, Inktank ross@inktank.com @rossturkinktank.com | ceph.com