0
S C A L I N G 	   S T O R A G E 	   W I T H 	   C E P H                 Ross	  Turk,	  Inktank	  
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
I N 	   T H E 	   B E G I N N I N GMagic Madzik, Flickr / CC BY 2.0
E A R L Y 	   I N F O R M A T I O N 	   S T O R A G EChico.Ferreira, Flickr / CC BY 2.0
W R I T I N G 	   > 	   C A V E 	   P A I N T I N G Skevingessner, Flickr / CC BY-SA 2.0
==x1000        x1
P E O P L E 	   B E G I N 	   W R I T I N G 	   A 	   L O TMoyan_Brenn, Flickr / CC BY-ND 2.0
W R I T I N G 	   I S 	   T I M E -­‐ C O N S U M I N Gtrekkyandy, Flickr / CC BY 2.0
T H E 	   I N D U S T R I A L I Z A T I O N 	   O F 	   W R I T I N GFateDenied, Flickr / CC BY 2.0
magnet       +   tape   =    magnetic tape                   ==         x1000              x1
S T O R A G E 	   B E C O M E S 	   M E C H A N I C A LErik Pitti, Wikipedia / CC BY-ND 2.0
HUMAN     ROCK          INKHUMAN         PAPERHUMAN   COMPUTER   TAPE
C O M P U T E R S 	   N E E D 	   P E O P L E 	   T O 	   W O R KUSDAgov, Flickr / CC BY 2.0
HUMAN   COMPUTER   TAPE
11101011 10110110     10110101 10101001     00100100 01001001     10100100 10100101==   01011010 01101010     10101010 101...
T H R O U G H P U T 	   B E C O M E S 	   I M P O R T A N TZane Luke, Flickr / CC BY-ND 2.0
L A Z 0 R 	   B 3 A M S 	   C H A N G E 	   E V E R Y T H I N G ! !Jeff Kubina, Flickr / CC-BY-SA 2.0
H A R D 	   D R I V E S 	   A R E 	   T O T A L L Y 	   B E T T E R                      amazing spinny hard drives       ...
E V E R Y T H I N G 	   G E T S 	   M E S S YRob!, Flickr / CC BY 2.0
aa      ab               111010               ac101   ba    bb                        bc    111   010da    110   db   011 ...
file                                    owner: rturk                                 created: aug12                       ...
aa      ab          111010               ac101   ba    bb                   bc    111   010da    110   db   01        010 ...
W E 	   O U T G R O W 	   T H E 	   H A R D 	   D R I V EMr. T in DC, Flickr / CC BY 2.0
DISK                   DISK                   DISKHUMAN   COMPUTER   DISK                   DISK                   DISK   ...
P E O P L E 	   N E E D 	   S I M U L T A N E O U S 	   A C C E S SwFourier, Flickr / CC BY 2.0
DISK                   DISKHUMAN                   DISKHUMAN   COMPUTER   DISK                   DISKHUMAN                ...
HUMAN          HUMAN                           HUMAN HUMAN                                                          DISK  ...
COMPUTER   DISK        COMPUTER   DISK        COMPUTER   DISKHUMAN        COMPUTER   DISK        COMPUTER   DISK        CO...
X                         aa      ab               111010               ac101   ba    bb                        bc    111 ...
object                                    pace: quick                                    driver: frog                     ...
COMPUTER   DISK      COMPUTER   DISK      COMPUTER   DISK      COMPUTER   DISK      COMPUTER   DISK      COMPUTER   DISKAP...
COMPUTER   DISK                  COMPUTER   DISK                  COMPUTER   DISK                  COMPUTER   DISK        ...
COMPUTER   DISK     COMPUTER   DISK     COMPUTER   DISK     COMPUTER   DISKVM   COMPUTER   DISK     COMPUTER   DISKVM   CO...
Ceph                                                                                                          Cloud comput...
COMPUTER   DISK        COMPUTER   DISK        COMPUTER   DISKHUMAN        COMPUTER   DISK        COMPUTER   DISK        CO...
COMPUTER   DISKCOMPUTER   DISKCOMPUTER   DISKCOMPUTER   DISKCOMPUTER   DISKCOMPUTER   DISKCOMPUTER   DISKCOMPUTER   DISKCO...
C DC DC DC DC DC DC DC DC DC DC DC D
C D        C D        C DHUMAN        C D        C D        C DHUMAN   C D        C D        C DHUMAN   C D        C D    ...
S T O R A G E 	   A P P L I A N C E SMichael Moll, Wikipedia / CC BY-SA 2.0
6 . 4 	   M I L L I O N 	   S Q F T 	   O F 	   F A C T O R I E SDude94111, Flickr / CC BY 2.0
S T O R A G E 	   V E N D O R S 	   H A V E 	   B I G 	   B I L L SCarbonNYC, Flickr / CC BY 2.0
S T O R A G E 	   A P P L I A N C E S 	   A R E 	   E X P E N S I V E401K 2012, Flickr / CC BY-SA 2.0
T E C H N O L O G Y 	   I S 	   A 	   C O M M O D I T YRaeAllen, Flickr / CC-BY 2.0
C O M M O D I T Y 	   P R I C E S 	   F L U C T U A T EMay-07           May-08          May-09          May-10         May...
G R O W I N G 	   W I T H 	   H A R D W A R E 	   A P P L I A N C E S     C   D    §  First PB                C   D      ...
A P P L I A N C E S 	   A R E 	   O L D 	   T E C H N O L O G YPaul Keller, Flickr / CC BY 2.0
Source: http://www.cpubenchmark.net/high_end_cpus.html
FLAGSHIPHARDWAREAPPLIANCE
Hardware Appliances are Mysterious Black BoxesAbode of Chaos, Flickr / CC BY 2.0
C   D      C   D C    C   D      C   D          D      C   D      C   DC++   C   D      C   D      C   D      C   D      C...
X      C   D      C   D C    C   D      C   D          D      C   D      C   DC++   C   D      C   D      C   D      C   D...
C   D                   C   D                   C   D                   C   D                   C   DHUMAN         !!   C ...
THE WORLD        NEEDSA STORAGE TECHNOLOGY        THAT   SCALES INFINITELY
THE WORLD         NEEDSA STORAGE TECHNOLOGY THAT DOESN’T REQUIRE          AN      INDUSTRIAL    MANUFACTURING        PROCESS
S A G E 	   W E I L§  Co-founder of DreamHost§  Inventor of Ceph§  CEO of Inktank
philosophy   designOPEN SOURCE
O P E N 	   S O U R C E 	   S P R E A D S 	   I D E A Sorchidgalore, Flickr / CC BY 2.0
philosophy   design      OPEN SOURCECOMMUNITY-FOCUSED
W E 	   A R E 	   S M A R T E R 	   T O G E T H E Rrturk, Linkedin Inmap
C E P H 	   B E L O N G S 	   T O 	   A L L 	   O F 	   U Swackybadger, Flickr / CC BY 2.0
philosophy   design      OPEN SOURCE     SCALABLECOMMUNITY-FOCUSED
Ceph                                                                             Too much for a room                      ...
philosophy   design      OPEN SOURCE     SCALABLECOMMUNITY-FOCUSED     NO SINGLE POINT OF FAILURE
A R I L O M A X 	   C A L I F O R N I C U Saroid, Flickr / CC BY 2.0
single point                                                of failure                                             highly-...
T H E 	   B E E H I V E 	   ( A N O T H E R 	   M E T A P H O R )blumenbiene, Flickr / CC BY 2.0
philosophy   design      OPEN SOURCE     SCALABLECOMMUNITY-FOCUSED     NO SINGLE POINT OF FAILURE                      SOF...
C   D      C   D C    C   D      C   D          D      C   D      C   DC++   C   D      C   D      C   D      C   D      C...
C   D      C   D✔ C    C   D      C   D          D      C   D      C   DC++   C   D      C   D      C   D      C   D      ...
philosophy   design      OPEN SOURCE     SCALABLECOMMUNITY-FOCUSED     NO SINGLE POINT OF FAILURE                      SOF...
D I S K S 	   = 	   J U S T 	   T I N Y 	   R E C O R D 	   P L A Y E R Sjon_a_ross, Flickr / CC BY 2.0
D    D  D    D  D    D      =  D    Dx 1 MILLION                  55 times / day
I T 	   A L L 	   S T A R T E D 	   W I T H 	   A 	   D R E A M
+
N E W 	   M O N T H L Y 	   C O D E 	   C O M M I T S700600500400300200100  0  2004-06      2005-07   2006-07   2007-07   ...
C E P H 	   S T A R T S 	   P O P P I N G 	   U P !                              (sorry about all the logo tampering)
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
OSD    OSD    OSD    OSD    OSD                                   btrfsFS      FS    FS     FS     FS                     ...
HUMAN        MM           M
M    Monitors:    §  Maintain cluster map    §  Provide consensus for        distributed decision-        making    §  ...
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
APP    LIBRADOS               native    MM               M
L    LIBRADOS    §  Provides direct access to        RADOS for applications    §  C, C++, Python, PHP,        Java    §...
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
APP                APP                                RESTRADOSGW          RADOSGW  LIBRADOS           LIBRADOS           ...
RADOS Gateway:§  REST-based interface to    RADOS§  Supports buckets,    accounting§  Compatible with S3 and    Swift a...
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
VMVIRTUALIZATION CONTAINER             LIBRBD            LIBRADOS        M   M                   M
CONTAINER            VM       CONTAINER   LIBRBD                        LIBRBD  LIBRADOS                      LIBRADOS    ...
HOST    KRBD (KERNEL MODULE)           LIBRADOS       MM                          M
RADOS Block Device:§  Storage of virtual disks in    RADOS§  Allows decoupling of VMs    and containers     §  Live mig...
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
CLIENTmetadata           01   data                   10               M           M            M
Metadata Server§  Manages metadata for a    POSIX-compliant shared    filesystem     §  Directory hierarchy     §  File...
WHAT MAKES CEPH   UNIQUE?
H O W 	   D O 	   Y O U 	   F I N D 	   Y O U R 	   K E Y S ?azmeen, Flickr / CC BY 2.0
C D           C D           C D           C D           C D      ??APP        C D           C D           C D           C ...
C D           C D   A-G           C D           C D           C D   H-NAPP   F*   C D           C D           C D   O-T   ...
I 	   A L W A Y S 	   P U T 	   M Y 	   K E Y S 	   O N 	   T H E 	   H O O Kvitamindave, Flickr / CC BY 2.0
C D      C D      C D      C D      C DAPP   C D      C D      C D      C D      C D      C D      C D
D E A R 	   D I A R Y : 	   K E Y S 	   = 	   I N 	   T H E 	   K I T C H E NBarnaby, Flickr / CC BY 2.0
HOW DO YOU  FIND YOUR KEYSWHEN YOUR HOUSE         IS   INFINITELY BIG        ANDALWAYS CHANGING?
T H E 	   A N S W E R : 	   C R U S H ! !pasukaru76, Flickr / CC SA 2.0
10 10 01 01 10 10 01 11 01 10                               hash(object name) % num pg10   10    01   01   10   10    01  ...
10 10 01 01 10 10 01 11 01 1010   10    01   01   10   10   01   11    01   10
CRUSH§  Pseudo-random placement    algorithm§  Ensures even distribution§  Repeatable, deterministic§  Rule-based conf...
CLIENT         ??
CLIENT         ??
VMVIRTUALIZATION CONTAINER             LIBRBD            LIBRADOS        M   M                   M
HOW DO YOU      SPIN UPTHOUSANDS OF VMs    INSTANTLY       AND  EFFICIENTLY?
instant copy144   0       0      0   0   = 144
write                          CLIENT                  write                  write                  write144   4   = 148
read                  read                         CLIENT                  read144   4   = 148
HOW DO YOU        MANAGE DIRECTORY HEIRARCHY        WITHOUT           ASINGLE POINT OF FAILURE?
F I L E S Y S T E M S 	   R E Q U I R E 	   M E T A D A T ABarnaby, Flickr / CC BY 2.0
CLIENT        01        10    MM            M
MM       M
one treethree metadata servers                              ??
DYNAMIC SUBTREE PARTITIONING
AND NOWBACKPEDALING
ALMOSTEVERYTHING  WORKS
APP                    APP                  HOST/VM                   CLIENT                       RADOSGW                ...
*LAN SCALE!!* OR REALLY REALLY SCARY FAST WAN
C E P H 	   A N D 	   C L O U D S T A C Ktableatny, Flickr / CC BY 2.0
R B D 	   S U P P O R T 	   I N 	   C L O U D S T A C K§  Just announced two weeks ago!§  Allows storage of virtual disk...
QUESTIONS?Ross TurkVP Community, Inktank§  ross@inktank.com§  @rossturkinktank.com | ceph.com
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
Upcoming SlideShare
Loading in...5
×

vBACD July 2012 - Scaling Storage with Ceph

1,616

Published on

"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank
Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,616
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
95
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "vBACD July 2012 - Scaling Storage with Ceph"

  1. 1. S C A L I N G   S T O R A G E   W I T H   C E P H Ross  Turk,  Inktank  
  2. 2. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  3. 3. I N   T H E   B E G I N N I N GMagic Madzik, Flickr / CC BY 2.0
  4. 4. E A R L Y   I N F O R M A T I O N   S T O R A G EChico.Ferreira, Flickr / CC BY 2.0
  5. 5. W R I T I N G   >   C A V E   P A I N T I N G Skevingessner, Flickr / CC BY-SA 2.0
  6. 6. ==x1000 x1
  7. 7. P E O P L E   B E G I N   W R I T I N G   A   L O TMoyan_Brenn, Flickr / CC BY-ND 2.0
  8. 8. W R I T I N G   I S   T I M E -­‐ C O N S U M I N Gtrekkyandy, Flickr / CC BY 2.0
  9. 9. T H E   I N D U S T R I A L I Z A T I O N   O F   W R I T I N GFateDenied, Flickr / CC BY 2.0
  10. 10. magnet + tape = magnetic tape == x1000 x1
  11. 11. S T O R A G E   B E C O M E S   M E C H A N I C A LErik Pitti, Wikipedia / CC BY-ND 2.0
  12. 12. HUMAN ROCK INKHUMAN PAPERHUMAN COMPUTER TAPE
  13. 13. C O M P U T E R S   N E E D   P E O P L E   T O   W O R KUSDAgov, Flickr / CC BY 2.0
  14. 14. HUMAN COMPUTER TAPE
  15. 15. 11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101== 01011010 01101010 10101010 10101010 01010110 01010011
  16. 16. T H R O U G H P U T   B E C O M E S   I M P O R T A N TZane Luke, Flickr / CC BY-ND 2.0
  17. 17. L A Z 0 R   B 3 A M S   C H A N G E   E V E R Y T H I N G ! !Jeff Kubina, Flickr / CC-BY-SA 2.0
  18. 18. H A R D   D R I V E S   A R E   T O T A L L Y   B E T T E R amazing spinny hard drives sucky stupid tape slow
  19. 19. E V E R Y T H I N G   G E T S   M E S S YRob!, Flickr / CC BY 2.0
  20. 20. aa ab 111010 ac101 ba bb bc 111 010da 110 db 011 010 000 dc000 110 001
  21. 21. file owner: rturk created: aug12 last viewed: aug17 size: 4202511101011 10110110 10110101 perms: 64410101001 00100100 0100100110100100 10100101 0101101001101010 10101010 10101010
  22. 22. aa ab 111010 ac101 ba bb bc 111 010da 110 db 01 010 000 dc 10000 110 001
  23. 23. W E   O U T G R O W   T H E   H A R D   D R I V EMr. T in DC, Flickr / CC BY 2.0
  24. 24. DISK DISK DISKHUMAN COMPUTER DISK DISK DISK DISK
  25. 25. P E O P L E   N E E D   S I M U L T A N E O U S   A C C E S SwFourier, Flickr / CC BY 2.0
  26. 26. DISK DISKHUMAN DISKHUMAN COMPUTER DISK DISKHUMAN DISK DISK
  27. 27. HUMAN HUMAN HUMAN HUMAN DISK HUMANHUMAN DISK HUMAN HUMAN DISK DISK HUMAN DISK HUMANHUMAN DISK (COMPUTER) HUMAN DISK HUMAN HUMAN DISK HUMAN HUMAN DISK HUMAN DISK HUMAN DISK HUMAN HUMAN DISK HUMAN HUMAN (actually more like this…)
  28. 28. COMPUTER DISK COMPUTER DISK COMPUTER DISKHUMAN COMPUTER DISK COMPUTER DISK COMPUTER DISKHUMAN COMPUTER DISK COMPUTER DISK COMPUTER DISKHUMAN COMPUTER DISK COMPUTER DISK COMPUTER DISK
  29. 29. X aa ab 111010 ac101 ba bb bc 111 010da 110 db 011 010 000 dc000 110 001
  30. 30. object pace: quick driver: frog license: expired expression: agog11101011 10110110 1011010110101001 00100100 0100100110100100 10100101 0101101001101010 10101010 10101010
  31. 31. COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISKAPP COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK
  32. 32. COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISKCOMPUTER COMPUTER DISK DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK
  33. 33. COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISKVM COMPUTER DISK COMPUTER DISKVM COMPUTER DISK COMPUTER DISKVM COMPUTER DISK COMPUTER DISK COMPUTER DISK COMPUTER DISK
  34. 34. Ceph Cloud computing Distributed storage Shared storage Computers Writing PaintingS T O R A G E   T H R O U G H O U T   H I S T O R YTime-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.
  35. 35. COMPUTER DISK COMPUTER DISK COMPUTER DISKHUMAN COMPUTER DISK COMPUTER DISK COMPUTER DISKHUMAN COMPUTER DISK COMPUTER DISK COMPUTER DISKHUMAN COMPUTER DISK COMPUTER DISK COMPUTER DISK
  36. 36. COMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISKCOMPUTER DISK
  37. 37. C DC DC DC DC DC DC DC DC DC DC DC D
  38. 38. C D C D C DHUMAN C D C D C DHUMAN C D C D C DHUMAN C D C D C D
  39. 39. S T O R A G E   A P P L I A N C E SMichael Moll, Wikipedia / CC BY-SA 2.0
  40. 40. 6 . 4   M I L L I O N   S Q F T   O F   F A C T O R I E SDude94111, Flickr / CC BY 2.0
  41. 41. S T O R A G E   V E N D O R S   H A V E   B I G   B I L L SCarbonNYC, Flickr / CC BY 2.0
  42. 42. S T O R A G E   A P P L I A N C E S   A R E   E X P E N S I V E401K 2012, Flickr / CC BY-SA 2.0
  43. 43. T E C H N O L O G Y   I S   A   C O M M O D I T YRaeAllen, Flickr / CC-BY 2.0
  44. 44. C O M M O D I T Y   P R I C E S   F L U C T U A T EMay-07 May-08 May-09 May-10 May-11 May-12
  45. 45. G R O W I N G   W I T H   H A R D W A R E   A P P L I A N C E S C D §  First PB C D §  Second PB C D §  Proprietary C D §  Proprietary C D storage C D storage C D hardware C D hardware C D §  Well-known C D §  Same storage C D storage C D vendor C D vendor C D C D C D C D C D §  Another $14 C D §  $14 b’zillion C D b’zillion C D C D C D C D
  46. 46. A P P L I A N C E S   A R E   O L D   T E C H N O L O G YPaul Keller, Flickr / CC BY 2.0
  47. 47. Source: http://www.cpubenchmark.net/high_end_cpus.html
  48. 48. FLAGSHIPHARDWAREAPPLIANCE
  49. 49. Hardware Appliances are Mysterious Black BoxesAbode of Chaos, Flickr / CC BY 2.0
  50. 50. C D C D C C D C D D C D C DC++ C D C D C D C D C D
  51. 51. X C D C D C C D C D D C D C DC++ C D C D C D C D C D
  52. 52. C D C D C D C D C DHUMAN !! C D[DEVELOPER] C D C D C D C D C D C D
  53. 53. THE WORLD NEEDSA STORAGE TECHNOLOGY THAT SCALES INFINITELY
  54. 54. THE WORLD NEEDSA STORAGE TECHNOLOGY THAT DOESN’T REQUIRE AN INDUSTRIAL MANUFACTURING PROCESS
  55. 55. S A G E   W E I L§  Co-founder of DreamHost§  Inventor of Ceph§  CEO of Inktank
  56. 56. philosophy designOPEN SOURCE
  57. 57. O P E N   S O U R C E   S P R E A D S   I D E A Sorchidgalore, Flickr / CC BY 2.0
  58. 58. philosophy design OPEN SOURCECOMMUNITY-FOCUSED
  59. 59. W E   A R E   S M A R T E R   T O G E T H E Rrturk, Linkedin Inmap
  60. 60. C E P H   B E L O N G S   T O   A L L   O F   U Swackybadger, Flickr / CC BY 2.0
  61. 61. philosophy design OPEN SOURCE SCALABLECOMMUNITY-FOCUSED
  62. 62. Ceph Too much for a room Too much for a computer Too much for a drive Too much for a book Too much for a caveC E P H   I S   B U I L T   T O   S C A L E
  63. 63. philosophy design OPEN SOURCE SCALABLECOMMUNITY-FOCUSED NO SINGLE POINT OF FAILURE
  64. 64. A R I L O M A X   C A L I F O R N I C U Saroid, Flickr / CC BY 2.0
  65. 65. single point of failure highly-availablereplicatedT H E   O C T O P U S   ( A   M E T A P H O R )I love speaking in metaphors.
  66. 66. T H E   B E E H I V E   ( A N O T H E R   M E T A P H O R )blumenbiene, Flickr / CC BY 2.0
  67. 67. philosophy design OPEN SOURCE SCALABLECOMMUNITY-FOCUSED NO SINGLE POINT OF FAILURE SOFTWARE BASED
  68. 68. C D C D C C D C D D C D C DC++ C D C D C D C D C D
  69. 69. C D C D✔ C C D C D D C D C DC++ C D C D C D C D C D
  70. 70. philosophy design OPEN SOURCE SCALABLECOMMUNITY-FOCUSED NO SINGLE POINT OF FAILURE SOFTWARE BASED SELF-MANAGING
  71. 71. D I S K S   =   J U S T   T I N Y   R E C O R D   P L A Y E R Sjon_a_ross, Flickr / CC BY 2.0
  72. 72. D D D D D D = D Dx 1 MILLION 55 times / day
  73. 73. I T   A L L   S T A R T E D   W I T H   A   D R E A M
  74. 74. +
  75. 75. N E W   M O N T H L Y   C O D E   C O M M I T S700600500400300200100 0 2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07
  76. 76. C E P H   S T A R T S   P O P P I N G   U P ! (sorry about all the logo tampering)
  77. 77. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  78. 78. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  79. 79. OSD OSD OSD OSD OSD btrfsFS FS FS FS FS xfs ext4DISK DISK DISK DISK DISK M M M
  80. 80. HUMAN MM M
  81. 81. M Monitors: §  Maintain cluster map §  Provide consensus for distributed decision- making §  Must have an odd number §  These do not serve stored objects to clients OSDs: §  One per disk (recommended) §  At least three in a cluster §  Serve stored objects to clients §  Intelligently peer to perform replication tasks §  Supports object classes
  82. 82. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  83. 83. APP LIBRADOS native MM M
  84. 84. L LIBRADOS §  Provides direct access to RADOS for applications §  C, C++, Python, PHP, Java §  No HTTP overhead
  85. 85. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  86. 86. APP APP RESTRADOSGW RADOSGW LIBRADOS LIBRADOS native M M M
  87. 87. RADOS Gateway:§  REST-based interface to RADOS§  Supports buckets, accounting§  Compatible with S3 and Swift applications
  88. 88. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  89. 89. VMVIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  90. 90. CONTAINER VM CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M
  91. 91. HOST KRBD (KERNEL MODULE) LIBRADOS MM M
  92. 92. RADOS Block Device:§  Storage of virtual disks in RADOS§  Allows decoupling of VMs and containers §  Live migration!§  Images are striped across the cluster§  Boot support in QEMU, KVM, and OpenStack Nova§  Mount support in the Linux kernel
  93. 93. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  94. 94. CLIENTmetadata 01 data 10 M M M
  95. 95. Metadata Server§  Manages metadata for a POSIX-compliant shared filesystem §  Directory hierarchy §  File metadata (owner, timestamps, mode, etc.)§  Stores metadata in RADOS§  Does not serve file data to clients§  Only required for shared filesystem
  96. 96. WHAT MAKES CEPH UNIQUE?
  97. 97. H O W   D O   Y O U   F I N D   Y O U R   K E Y S ?azmeen, Flickr / CC BY 2.0
  98. 98. C D C D C D C D C D ??APP C D C D C D C D C D C D C D
  99. 99. C D C D A-G C D C D C D H-NAPP F* C D C D C D O-T C D C D C D U-Z C D
  100. 100. I   A L W A Y S   P U T   M Y   K E Y S   O N   T H E   H O O Kvitamindave, Flickr / CC BY 2.0
  101. 101. C D C D C D C D C DAPP C D C D C D C D C D C D C D
  102. 102. D E A R   D I A R Y :   K E Y S   =   I N   T H E   K I T C H E NBarnaby, Flickr / CC BY 2.0
  103. 103. HOW DO YOU FIND YOUR KEYSWHEN YOUR HOUSE IS INFINITELY BIG ANDALWAYS CHANGING?
  104. 104. T H E   A N S W E R :   C R U S H ! !pasukaru76, Flickr / CC SA 2.0
  105. 105. 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg10 10 01 01 10 10 01 11 01 10 CRUSH(pg, cluster state, rule set)
  106. 106. 10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
  107. 107. CRUSH§  Pseudo-random placement algorithm§  Ensures even distribution§  Repeatable, deterministic§  Rule-based configuration §  Replica count §  Infrastructure topology §  Weighting
  108. 108. CLIENT ??
  109. 109. CLIENT ??
  110. 110. VMVIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  111. 111. HOW DO YOU SPIN UPTHOUSANDS OF VMs INSTANTLY AND EFFICIENTLY?
  112. 112. instant copy144 0 0 0 0 = 144
  113. 113. write CLIENT write write write144 4 = 148
  114. 114. read read CLIENT read144 4 = 148
  115. 115. HOW DO YOU MANAGE DIRECTORY HEIRARCHY WITHOUT ASINGLE POINT OF FAILURE?
  116. 116. F I L E S Y S T E M S   R E Q U I R E   M E T A D A T ABarnaby, Flickr / CC BY 2.0
  117. 117. CLIENT 01 10 MM M
  118. 118. MM M
  119. 119. one treethree metadata servers ??
  120. 120. DYNAMIC SUBTREE PARTITIONING
  121. 121. AND NOWBACKPEDALING
  122. 122. ALMOSTEVERYTHING WORKS
  123. 123. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible distributed block distributed file apps to directly with S3 and Swift device, with a Linux system, with a Linux access RADOS, kernel client and a kernel client and with support for QEMU/KVM driver support for FUSE C, C++, Java, Python, Ruby, and PHP AWESOME AWESOME NEARLY AWESOME AWESOMERADOS AWESOMEA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes
  124. 124. *LAN SCALE!!* OR REALLY REALLY SCARY FAST WAN
  125. 125. C E P H   A N D   C L O U D S T A C Ktableatny, Flickr / CC BY 2.0
  126. 126. R B D   S U P P O R T   I N   C L O U D S T A C K§  Just announced two weeks ago!§  Allows storage of virtual disks inside RADOS §  Works with KVM only right now §  No volume snapshots yet§  Requires the latest version of, um, everything§  More information can be found on the mailing list: §  ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505
  127. 127. QUESTIONS?Ross TurkVP Community, Inktank§  ross@inktank.com§  @rossturkinktank.com | ceph.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×