DEVIEW 2013

1,371 views

Published on

Slides from my talk at DEVIEW 2013

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,371
On SlideShare
0
From Embeds
0
Number of Embeds
396
Actions
Shares
0
Downloads
48
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

DEVIEW 2013

  1. 1. Ceph: Massively Scalable Distributed Storage Patrick McGarry Director Community, Inktank
  2. 2. Who am I? ● Patrick McGarry ● Dir Community, Inktank ● /. → ALU → P4 → Inktank ● @scuttlemonkey ● patrick@inktank.com
  3. 3. Outline ● What is Ceph? ● Ceph History ● How does Ceph work? ● CRUSH Fundamentals ● Applications ● Testing ● Performance ● Moving forward ● Questions
  4. 4. What is Ceph?
  5. 5. Ceph at a glance So ftw a re A ll- in - 1 CR U SH S c a le O n c o m m o d ity h a rd w a re O b je c t , b lo c k , a n d file A w esom esauce H uge and b eyond C e p h ca n ru n o n a n y in f r a s t r u c t u r e , m e t a l o r v ir t u a liz e d t o p r o v id e a c h e a p a n d p o w e rfu l sto ra g e c lu s t e r . L o w o ve rh e a d d o e sn ’t m e a n ju st h a r d w a r e , it m e a n s p e o p le t o o ! In f r a s t r u c t u r e - a w a r e p la c e m e n t a lg o r it h m a llo w s y o u t o d o r e a lly c o o l s t u f f . D e s ig n e d f o r e x a b yte , cu rre n t im p le m e n t a t io n s in t h e m u lt i- p e t a b y t e . H P C , B ig D a t a , C lo u d , r a w s t o r a g e .
  6. 6. Unified storage system ● objects ● ● ● native RESTful block ● ● thin provisioning, snapshots, cloning file ● strong consistency, snapshots
  7. 7. Distributed storage system ● data center scale ● ● ● 10s to 10,000s of machines terabytes to exabytes fault tolerant ● ● ● no single point of failure commodity hardware self-managing, self-healing
  8. 8. Where did Ceph come from?
  9. 9. UCSC research grant ● “Petascale object storage” ● DOE: LANL, LLNL, Sandia ● Scalability ● Reliability ● Performance ● ● Raw IO bandwidth, metadata ops/sec HPC file system workloads ● Thousands of clients writing to same file, directory
  10. 10. Distributed metadata management ● Innovative design ● Subtree-based partitioning for locality, efficiency ● Dynamically adapt to current workload ● Embedded inodes ● Prototype simulator in Java (2004) ● First line of Ceph code ● Summer internship at LLNL ● High security national lab environment ● Could write anything, as long as it was OSS
  11. 11. The rest of Ceph ● RADOS – distributed object storage cluster (2005) ● EBOFS – local object storage (2004/2006) ● CRUSH – hashing for the real world (2005) ● Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
  12. 12. Industry black hole ● Many large storage vendors ● ● Proprietary solutions that don't scale well Few open source alternatives (2006) ● ● Limited community and architecture (Lustre) ● ● Very limited scale, or No enterprise feature sets (snapshots, quotas) PhD grads all built interesting systems... ● ● ...and then went to work for Netapp, DDN, EMC, Veritas. They want you, not your project
  13. 13. A different path ● Change the world with open source ● ● ● Do what Linux did to Solaris, Irix, Ultrix, etc. What could go wrong? License ● ● ● GPL, BSD... LGPL: share changes, okay to link to proprietary code Avoid unsavory practices ● Dual licensing ● Copyright assignment
  14. 14. How does Ceph work?
  15. 15. APP APP LIBRADOS LIBRADOS APP APP RADOSGW RADOSGW AA bucket-based bucket-based AA library allowing REST gateway, library allowing REST gateway, apps to directly compatible with S3 apps to directly compatible with S3 access RADOS, and Swift access RADOS, and Swift with support for with support for C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP HOST/VM HOST/VM CLIENT CLIENT RBD RBD CEPH FS CEPH FS AA reliable and fullyreliable and fullydistributed block distributed block device, with a a Linux device, with Linux kernel client and a a kernel client and QEMU/KVM driver QEMU/KVM driver AA POSIX-compliant POSIX-compliant distributed file system, distributed file system, with a a Linux kernel with Linux kernel client and support for client and support for FUSE FUSE RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  16. 16. Why start with objects? ● more useful than (disk) blocks ● ● variable size ● ● names in a single flat namespace simple API with rich semantics more scalable than files ● no hard-to-distribute hierarchy ● update semantics do not span objects ● workload is trivially parallel
  17. 17. Ceph object model ● pools ● ● independent namespaces or object collections ● ● 1s to 100s replication level, placement policy objects ● bazillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  18. 18. OSD OSD OSD OSD OSD FS FS FS FS FS DISK DISK DISK DISK DISK M M M btrfs xfs ext4
  19. 19. Object Storage Daemons (OSDs) ● 10s to 10000s in a cluster ● one per disk, SSD, or RAID group, or ... ● ● ● hardware agnostic serve stored objects to clients intelligently peer to perform replication and recovery tasks Monitors M ● ● maintain cluster membership and state provide consensus for distributed decisionmaking ● small, odd number ● these do not served stored objects to clients
  20. 20. Data distribution ● ● ● all objects are replicated N times objects are automatically placed, balanced, migrated in a dynamic cluster must consider physical infrastructure ● ● ceph-osds on hosts in racks in rows in data centers three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  21. 21. CRUSH fundamentals
  22. 22. CRUSH ● pseudo-random placement algorithm ● fast calculation, no lookup ● repeatable, deterministic ● statistically uniform distribution ● stable mapping ● ● limited data migration on change rule-based configuration ● infrastructure topology aware ● adjustable replication ● allows weighting
  23. 23. [Mapping] [Buckets] [Rules] [Map]
  24. 24. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg 10 10 10 10 01 01 01 01 10 10 10 10 01 01 1111 01 01 10 10 CRUSH(pg, cluster state, policy)
  25. 25. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 10 10 10 10 01 01 01 01 10 10 10 10 01 01 1111 01 01 10 10
  26. 26. CLIENT CLIENT ??
  27. 27. CLIENT ??
  28. 28. Applications
  29. 29. APP APP LIBRADOS LIBRADOS APP APP RADOSGW RADOSGW AA bucket-based bucket-based AA library allowing REST gateway, library allowing REST gateway, apps to directly compatible with S3 apps to directly compatible with S3 access RADOS, and Swift access RADOS, and Swift with support for with support for C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP HOST/VM HOST/VM CLIENT CLIENT RBD RBD CEPH FS CEPH FS AA reliable and fullyreliable and fullydistributed block distributed block device, with a a Linux device, with Linux kernel client and a a kernel client and QEMU/KVM driver QEMU/KVM driver AA POSIX-compliant POSIX-compliant distributed file system, distributed file system, with a a Linux kernel with Linux kernel client and support for client and support for FUSE FUSE RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  30. 30. RADOS Gateway ● REST-based object storage proxy ● uses RADOS to store objects ● API supports buckets, accounting ● usage accounting for billing purposes ● compatible with S3, Swift APIs
  31. 31. RADOS Block Device ● storage of disk images in RADOS ● decouple VM from host ● images striped across entire cluster (pool) ● snapshots ● copy-on-write clones ● support in ● Qemu/KVM, Xen ● mainline Linux kernel (2.6.34+) ● OpenStack, CloudStack ● Ganeti, OpenNebula
  32. 32. Metadata Server (MDS) ● manages metadata for POSIX shared file system ● directory hierarchy ● file metadata (size, owner, timestamps) ● stores metadata in RADOS ● does not serve file data to clients ● only required for the shared file system
  33. 33. Multiple protocols, implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt ● export (NFS), Samba (CIFS) ● ceph-fuse ● libcephfs.so ● your app ● Samba (CIFS) ● Ganesha (NFS) ● SMB/CIFS NFS Ganesha libcephfs Samba libcephfs Hadoop libcephfs your app libcephfs Hadoop (map/reduce) ceph ceph-fuse fuse kernel
  34. 34. Testing
  35. 35. Package Building ● GitBuilder ● ● Very basic tool, works well though ● Push to get → .deb && .rpm for all platforms ● ● Build packages / tarballs autosigned Pbuilder ● Used for major releases ● Clean room type of build ● Signed using different key ● As many distros as possible (All Debian, Ubuntu, Fedora 17&18, CentOS, SLES, OpenSUSE, tarball)
  36. 36. Teuthology ● Our own custom test framework ● Allocates machines you define to cluster you define ● Runs tasks against that environment ● Set up ceph ● Run workloads ● Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount it → run workload ● Makes it easy to stack things up, map them to hosts, run ● Suite of tests on our github ● Active community development underway!
  37. 37. Performance
  38. 38. 4K Relative Performance All Workloads: QEMU/KVM, BTRFS, or XFS Read Heavy Only Workloads: Kernel RBD, EXT4, or XFS
  39. 39. 128K Relative Performance All Workloads: QEMU/KVM, or XFS Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
  40. 40. 4M Relative Performance All Workloads: QEMU/KVM, or XFS Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
  41. 41. What does this mean? ● Anecdotal results are great! ● Concrete evidence coming soon ● Performance is complicated ● Workload ● Hardware (especially network!) ● Lots and lots of tuneables ● Mark Nelson (nhm on #ceph) is the performance geek ● Lots of reading on ceph.com blog ● We welcome testing and comparison help ● Anyone want to help with an EMC / NetApp showdown?
  42. 42. Moving forward
  43. 43. Future plans G e o - R e p lic a t io n E r a s u r e C o d in g T ie r in g G o v e rn a n c e A n o n g o in g p r o c e s s R e c e p t io n e f f ic ie n c y H e a d e d t o d y n a m ic M a k in g it o p e n - e r W h ile t h e f ir s t p a s s f o r d is a s t e r r e c o v e r y is d o n e , w e w a n t t o g e t t o b u ilt - in , w o r ld - w id e r e p lic a t io n . C u r r e n t ly u n d e r w a y in t h e c o m m u n it y ! C a n a lr e a d y d o t h is in a s t a t ic p o o lb a s e d s e t u p . L o o k in g to g e t to a u se -b a se d m ig r a t io n . B e e n t a lk in g a b o u t it f o r e v e r. T h e t i m e i s c o m in g !
  44. 44. Get Involved! CDS Q u a r t e r ly O n li n e S u m m i t Ceph Day IR C N o t ju st fo r N Y C G e e k -o n -d u ty E m a i l m a k e s t h e w o r ld g o O n lin e s u m m it p u t s th e c o re d e vs t o g e t h e r w it h t h e C e p h c o m m u n it y. M o r e p la n n e d , in c lu d in g S a n t a C la r a and London. Keep an eye out: D u r in g t h e w e e k t h e r e a r e t im e s w h e n C e p h e x p e rts a r e a v a ila b le t o h e lp . S t o p b y o ftc.n e t/ ce p h O u r m a ilin g lis t s a r e v e r y a c t iv e , c h e c k o u t ce p h .co m fo r d e t a ils o n h o w t o j o in in ! h ttp : / / in k ta n k . c o m / c e p h d a y s/ L is t s
  45. 45. E - M A IL p a tr ic k @ in k ta n k . c o m W E B S IT E C ep h .co m S O C IA L @ s c u t t le m o n k e y @ ceph F a ce b o o k .co m / ce p h sto ra g e Questions?

×