Building Tomorrow's Ceph
Sage Weil
Research beginnings
9
UCSC research grant
 “Petascale object storage”
 US Dept of Energy: LANL, LLNL, Sandia
 Scalability
 Reliability
 Per...
Distributed metadata management
 Innovative design
 Subtree-based partitioning for locality, efficiency
 Dynamically ad...
The rest of Ceph
 RADOS – distributed object storage cluster (2005)
 EBOFS – local object storage (2004/2006)
 CRUSH – ...
 Click to edit the outline text format
 Second Outline Level
Third Outline Level
 Fourth Outline Level
Fifth Outline Le...
Industry black hole
 Many large storage vendors
 Proprietary solutions that don't scale well
 Few open source alternati...
A different path
 Change the world with open source
 Do what Linux did to Solaris, Irix, Ultrix, etc.
 What could go wr...
Incubation
17
DreamHost!
 Move back to Los Angeles, continue hacking
 Hired a few developers
 Pure development
 No deliverables
Ambitious feature set
 Native Linux kernel client (2007-)
 Per-directory snapshots (2008)
 Recursive accounting (2008)
...
The kernel client
 ceph-fuse was limited, not very fast
 Build native Linux kernel implementation
 Began attending Linu...
Part of a larger ecosystem
 Ceph need not solve all problems as monolithic stack
 Replaced ebofs object file system with...
Budding community
 #ceph on irc.oftc.net, ceph-devel@vger.kernel.org
 Many interested users
 A few developers
 Many fa...
Road to product
 DreamHost decides to build an S3-compatible object
storage service with Ceph
 Stability
 Focus on core...
The reality
 Growing incoming commercial interest
 Early attempts from organizations large and small
 Difficult to enga...
Launch
26
Do it right
 How do we build a strong open source company?
 How do we build a strong open source community?
 Models?
 ...
Goals
 A stable Ceph release for production deployment
 DreamObjects
 Lay foundation for widespread adoption
 Platform...
Branding
 Early decision to engage professional agency
 MetaDesign
 Terms like
 “Brand core”
 “Design system”
 Keep ...
 Click to edit the outline text format
Second Outline Level
Slick graphics
 broken powerpoint template
31
Today: adoption
32
Traction
 Too many production deployments to count
 We don't know about most of them!
 Too many customers (for me) to c...
Quality
 Increased adoption means increased demands on robust
testing
 Across multiple platforms
 Upgrades
 Rolling up...
Developer community
 Significant external contributors
 Many full-time contributors outside of Inktank
 First-class fea...
CDS: Ceph Developer Summit
 Community process for building project roadmap
 100% online
 Google hangouts
 Wikis
 Ethe...
Erasure coding
 Replication for redundancy is flexible and fast
 For larger clusters, it can be expensive
 Erasure code...
Tiering
 Client side caches are great, but only buy so much.
 Separate hot and cold data onto different storage devices
...
The Future
40
Technical roadmap
 How do we reach new use-cases and users
 How do we better satisfy existing users
 How do we ensure C...
Multi-datacenter, geo-replication
 Ceph was originally designed for single DC clusters
 Synchronous replication
 Strong...
RGW: Multi-site and async replication
 Multi-site, multi-cluster
 Regions: east coast, west coast, etc.
 Zones: radosgw...
RBD: block devices
 Today: backup capability
 Based on block device snapshots
 Efficiently mirror changes between conse...
Async replication in RADOS
 One implementation to capture multiple use-cases
 RBD, CephFS, RGW, … RADOS
 A harder probl...
CephFS
→ This is where it all started – let's get there
 Today
 Stabilization of multi-MDS, directory fragmentation, QA
...
Governance
How do we strengthen the project community?
 2014 is the year
 Recognized project leads
 RBD, RGW, RADOS, Ce...
The larger ecosystem
The enterprise
How do we pay for all of this?
 Support legacy and transitional client/server interfaces
 iSCSI, NFS, pNF...
Why Ceph is the Future of Storage
 It is hard to compete with free and open source software
 Unbeatable value propositio...
 Click to edit the outline text format
Second Outline Level
Thank you, and Welcome!
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Upcoming SlideShare
Loading in …5
×

Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

2,756 views
2,684 views

Published on

Sage Weil, Founder & CTO, Inktank

Published in: Technology

Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

  1. 1. Building Tomorrow's Ceph Sage Weil
  2. 2. Research beginnings 9
  3. 3. UCSC research grant  “Petascale object storage”  US Dept of Energy: LANL, LLNL, Sandia  Scalability  Reliability  Performance  Raw IO bandwidth, metadata ops/sec  HPC file system workloads  Thousands of clients writing to same file, directory
  4. 4. Distributed metadata management  Innovative design  Subtree-based partitioning for locality, efficiency  Dynamically adapt to current workload  Embedded inodes  Prototype simulator in Java (2004)  First line of Ceph code  Summer internship at LLNL  High security national lab environment  Could write anything, as long as it was OSS
  5. 5. The rest of Ceph  RADOS – distributed object storage cluster (2005)  EBOFS – local object storage (2004/2006)  CRUSH – hashing for the real world (2005)  Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
  6. 6.  Click to edit the outline text format  Second Outline Level Third Outline Level  Fourth Outline Level Fifth Outline Level Sixth Outline Level Seventh Outline Level Eighth Outline Level Ninth Outline LevelClick to edit Master text styles
  7. 7. Industry black hole  Many large storage vendors  Proprietary solutions that don't scale well  Few open source alternatives (2006)  Very limited scale, or  Limited community and architecture (Lustre)  No enterprise feature sets (snapshots, quotas)  PhD grads all built interesting systems...  ...and then went to work for Netapp, DDN, EMC, Veritas.  They want you, not your project
  8. 8. A different path  Change the world with open source  Do what Linux did to Solaris, Irix, Ultrix, etc.  What could go wrong?  License  GPL, BSD...  LGPL: share changes, okay to link to proprietary code  Avoid community un-friendly practices  No dual licensing  No copyright assignment
  9. 9. Incubation 17
  10. 10. DreamHost!  Move back to Los Angeles, continue hacking  Hired a few developers  Pure development  No deliverables
  11. 11. Ambitious feature set  Native Linux kernel client (2007-)  Per-directory snapshots (2008)  Recursive accounting (2008)  Object classes (2009)  librados (2009)  radosgw (2009)  strong authentication (2009)  RBD: rados block device (2010)
  12. 12. The kernel client  ceph-fuse was limited, not very fast  Build native Linux kernel implementation  Began attending Linux file system developer events (LSF)  Early words of encouragement from ex-Lustre devs  Engage Linux fs developer community as peer  Eventually merged CephFS client for v2.6.34 (early 2010)  RBD client merged in 2011
  13. 13. Part of a larger ecosystem  Ceph need not solve all problems as monolithic stack  Replaced ebofs object file system with btrfs  Same design goals  Robust, well optimized  Kernel-level cache management  Copy-on-write, checksumming, other goodness  Contributed some early functionality  Cloning files  Async snapshots
  14. 14. Budding community  #ceph on irc.oftc.net, ceph-devel@vger.kernel.org  Many interested users  A few developers  Many fans  Too unstable for any real deployments  Still mostly focused on right architecture and technical solutions
  15. 15. Road to product  DreamHost decides to build an S3-compatible object storage service with Ceph  Stability  Focus on core RADOS, RBD, radosgw  Paying back some technical debt  Build testing automation  Code review!  Expand engineering team
  16. 16. The reality  Growing incoming commercial interest  Early attempts from organizations large and small  Difficult to engage with a web hosting company  No means to support commercial deployments  Project needed a company to back it  Fund the engineering effort  Build and test a product  Support users  Bryan built a framework to spin out of DreamHost
  17. 17. Launch 26
  18. 18. Do it right  How do we build a strong open source company?  How do we build a strong open source community?  Models?  RedHat, Cloudera, MySQL, Canonical, …  Initial funding from DreamHost, Mark Shuttleworth
  19. 19. Goals  A stable Ceph release for production deployment  DreamObjects  Lay foundation for widespread adoption  Platform support (Ubuntu, Redhat, SuSE)  Documentation  Build and test infrastructure  Build a sales and support organization  Expand engineering organization
  20. 20. Branding  Early decision to engage professional agency  MetaDesign  Terms like  “Brand core”  “Design system”  Keep project and company independent  Inktank != Ceph  The Future of Storage
  21. 21.  Click to edit the outline text format Second Outline Level Slick graphics  broken powerpoint template 31
  22. 22. Today: adoption 32
  23. 23. Traction  Too many production deployments to count  We don't know about most of them!  Too many customers (for me) to count  Expansive partner list  Lots of inbound  Lots of press and buzz
  24. 24. Quality  Increased adoption means increased demands on robust testing  Across multiple platforms  Upgrades  Rolling upgrades  Inter-version compatibility
  25. 25. Developer community  Significant external contributors  Many full-time contributors outside of Inktank  First-class feature contributions from contributors  Non-Inktank participants in daily stand-ups  External access to build/test lab infrastructure  Common toolset  Github  Email (kernel.org)  IRC (oftc.net)  Linux distros
  26. 26. CDS: Ceph Developer Summit  Community process for building project roadmap  100% online  Google hangouts  Wikis  Etherpad  Quarterly  Our 4th CDS next week  Great participation  Ongoing indoctrination of Inktank engineers to open development model
  27. 27. Erasure coding  Replication for redundancy is flexible and fast  For larger clusters, it can be expensive  Erasure coded data is hard to modify, but ideal for cold or read-only objects  Will be used directly by radosgw  Coexists with new tiering capability Storage overhead Repair traffic MTTDL (days) 3x replication 3x 1x 2.3 E10 RS (10, 4) 1.4x 10x 3.3 E13 LRC (10, 6, 5) 1.6x 5x 1.2 E15
  28. 28. Tiering  Client side caches are great, but only buy so much.  Separate hot and cold data onto different storage devices  Promote hot objects into a faster (e.g., flash-backed) cache pool  Push cold object back into slower (e.g., erasure-coded) base pool  Use bloom filters to track temperature  Common in enterprise solutions; not found in open source scale-out systems → new (with erasure coding) in Firefly release
  29. 29. The Future 40
  30. 30. Technical roadmap  How do we reach new use-cases and users  How do we better satisfy existing users  How do we ensure Ceph can succeed in enough markets for supporting organizations to thrive  Enough breadth to expand and grow the community  Enough focus to do well
  31. 31. Multi-datacenter, geo-replication  Ceph was originally designed for single DC clusters  Synchronous replication  Strong consistency  Growing demand  Enterprise: disaster recovery  ISPs: replication data across sites for locality  Two strategies:  use-case specific: radosgw, RBD  low-level capability in RADOS
  32. 32. RGW: Multi-site and async replication  Multi-site, multi-cluster  Regions: east coast, west coast, etc.  Zones: radosgw sub-cluster(s) within a region  Can federate across same or multiple Ceph clusters  Sync user and bucket metadata across regions  Global bucket/user namespace, like S3  Synchronize objects across zones  Within the same region  Across regions  Admin control over which zones are master/slave
  33. 33. RBD: block devices  Today: backup capability  Based on block device snapshots  Efficiently mirror changes between consecutive snapshots across clusters  Now supported/orchestrated by OpenStack  Good for coarse synchronization (e.g., hours or days)  Tomorrow: data journaling for async mirroring  Pending blueprint at next week's CDS  Mirror active block device to remote cluster  Possibly with some configurable delay
  34. 34. Async replication in RADOS  One implementation to capture multiple use-cases  RBD, CephFS, RGW, … RADOS  A harder problem  Scalable: 1000s OSDs → 1000s of OSDs  Point-in-time consistency  Challenging research problem → Ongoing design discussion among developers
  35. 35. CephFS → This is where it all started – let's get there  Today  Stabilization of multi-MDS, directory fragmentation, QA  NFS, CIFS, Hadoop/HDFS bindings complete but not productized  Need  Greater QA investment  Fsck  Snapshots  Amazing community effort (Intel, NUDT and Kylin)  2014 is the year
  36. 36. Governance How do we strengthen the project community?  2014 is the year  Recognized project leads  RBD, RGW, RADOS, CephFS, ...  Formalize emerging processes around CDS, community roadmap  External foundation?
  37. 37. The larger ecosystem
  38. 38. The enterprise How do we pay for all of this?  Support legacy and transitional client/server interfaces  iSCSI, NFS, pNFS, CIFS, S3/Swift  VMWare, Hyper-V  Identify the beachhead use-cases  Earn others later  Single platform – shared storage resource  Bottom-up: earn respect of engineers and admins  Top-down: strong brand and compelling product
  39. 39. Why Ceph is the Future of Storage  It is hard to compete with free and open source software  Unbeatable value proposition  Ultimately a more efficient development model  It is hard to manufacture community  Strong foundational architecture  Next-generation protocols, Linux kernel support  Unencumbered by legacy protocols like NFS  Move from client/server to client/cluster  Ongoing paradigm shift  Software defined infrastructure, data center  Widespread demand for open platforms
  40. 40.  Click to edit the outline text format Second Outline Level Thank you, and Welcome!

×