London Ceph Day Keynote: Building Tomorrow's Ceph
Upcoming SlideShare
Loading in...5

London Ceph Day Keynote: Building Tomorrow's Ceph



Sage Weil, Creator of Ceph, Founder & CTO, Inktank

Sage Weil, Creator of Ceph, Founder & CTO, Inktank



Total Views
Views on SlideShare
Embed Views



1 Embed 9 9



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />

London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph Presentation Transcript

  • Building Tomorrow's Ceph Sage Weil
  • Research beginnings 9
  • UCSC research grant ● “Petascale object storage” ● DOE: LANL, LLNL, Sandia ● Scalability ● Reliability ● Performance ● ● Raw IO bandwidth, metadata ops/sec HPC file system workloads ● Thousands of clients writing to same file, directory
  • Distributed metadata management ● Innovative design ● Subtree-based partitioning for locality, efficiency ● Dynamically adapt to current workload ● Embedded inodes ● Prototype simulator in Java (2004) ● First line of Ceph code ● Summer internship at LLNL ● High security national lab environment ● Could write anything, as long as it was OSS
  • The rest of Ceph ● RADOS – distributed object storage cluster (2005) ● EBOFS – local object storage (2004/2006) ● CRUSH – hashing for the real world (2005) ● Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
  • Industry black hole ● Many large storage vendors ● ● Proprietary solutions that don't scale well Few open source alternatives (2006) ● ● Limited community and architecture (Lustre) ● ● Very limited scale, or No enterprise feature sets (snapshots, quotas) PhD grads all built interesting systems... ● ● ...and then went to work for Netapp, DDN, EMC, Veritas. They want you, not your project
  • A different path ● Change the world with open source ● ● ● Do what Linux did to Solaris, Irix, Ultrix, etc. What could go wrong? License ● ● ● GPL, BSD... LGPL: share changes, okay to link to proprietary code Avoid unsavory practices ● Dual licensing ● Copyright assignment
  • Incubation 17
  • DreamHost! ● Move back to LA, continue hacking ● Hired a few developers ● Pure development ● No deliverables
  • Ambitious feature set ● Native Linux kernel client (2007-) ● Per-directory snapshots (2008) ● Recursive accounting (2008) ● Object classes (2009) ● librados (2009) ● radosgw (2009) ● strong authentication (2009) ● RBD: rados block device (2010)
  • The kernel client ● ceph-fuse was limited, not very fast ● Build native Linux kernel implementation ● Began attending Linux file system developer events (LSF) ● ● ● Early words of encouragement from ex-Lustre devs Engage Linux fs developer community as peer Initial attempts merge rejected by Linus ● ● ● Not sufficient evidence of user demand A few fans and would-be users chimed in... Eventually merged for v2.6.34 (early 2010)
  • Part of a larger ecosystem ● Ceph need not solve all problems as monolithic stack ● Replaced ebofs object file system with btrfs ● ● Avoid reinventing the wheel ● Robust, well-supported, well optimized ● Kernel-level cache management ● ● Same design goals Copy-on-write, checksumming, other goodness Contributed some early functionality ● Cloning files ● Async snapshots
  • Budding community ● #ceph on, ● Many interested users ● A few developers ● Many fans ● Too unstable for any real deployments ● Still mostly focused on right architecture and technical solutions
  • Road to product ● ● DreamHost decides to build an S3-compatible object storage service with Ceph Stability ● ● Focus on core RADOS, RBD, radosgw Paying back some technical debt ● ● ● Build testing automation Code review! Expand engineering team
  • The reality ● Growing incoming commercial interest ● Early attempts from organizations large and small ● Difficult to engage with a web hosting company ● No means to support commercial deployments ● Project needed a company to back it ● ● Build and test a product ● ● Fund the engineering effort Support users Bryan built a framework to spin out of DreamHost
  • Launch 26
  • Do it right ● How do we build a strong open source company? ● How do we build a strong open source community? ● Models? ● ● RedHat, Cloudera, MySQL, Canonical, … Initial funding from DreamHost, Mark Shuttleworth
  • Goals ● A stable Ceph release for production deployment ● ● DreamObjects Lay foundation for widespread adoption ● Platform support (Ubuntu, Redhat, SuSE) ● Documentation ● Build and test infrastructure ● Build a sales and support organization ● Expand engineering organization
  • Branding ● Early decision to engage professional agency ● ● MetaDesign Terms like ● ● ● “Brand core” “Design system” Project vs Company ● ● ● Shared / Separate / Shared core Inktank != Ceph Aspirational messaging: The Future of Storage
  • Slick graphics ● broken powerpoint template 31
  • Today: adoption 32
  • Traction ● Too many production deployments to count ● We don't know about most of them! ● Too many customers (for me) to count ● Growing partner list ● ● Lots of inbound Lots of press and buzz
  • Quality ● Increased adoption means increased demands on robust testing ● Across multiple platforms ● Include platforms we don't like ● Upgrades ● ● ● Rolling upgrades Inter-version compatibility Expanding user community + less noise about bugs = a good sign
  • Developer community ● Significant external contributors ● First-class feature contributions from contributors ● Non-Inktank participants in daily Inktank stand-ups ● External access to build/test lab infrastructure ● Common toolset ● ● Email ( ● ● Github IRC ( Linux distros
  • CDS: Ceph Developer Summit ● Community process for building project roadmap ● 100% online ● Google hangouts ● Wikis ● Etherpad ● First was this Spring, second is next week ● Great feedback, growing participation ● Indoctrinating our own developers to an open development model
  • The Future 38
  • Governance How do we strengthen the project community? ● 2014 is the year ● Might formally acknowledge my role as BDL ● Recognized project leads ● RBD, RGW, RADOS, CephFS) ● Formalize processes around CDS, community roadmap ● External foundation?
  • Technical roadmap ● How do we reach new use-cases and users ● How do we better satisfy existing users ● How do we ensure Ceph can succeed in enough markets for Inktank to thrive ● Enough breadth to expand and grow the community ● Enough focus to do well
  • Tiering ● ● Client side caches are great, but only buy so much. Can we separate hot and cold data onto different storage devices? ● ● ● ● Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding) How do you identify what is hot and cold? Common in enterprise solutions; not found in open source scale-out systems → key topic at CDS next week
  • Erasure coding ● Replication for redundancy is flexible and fast ● For larger clusters, it can be expensive Storage overhead 3x replication Repair traffic MTTDL (days) 1x 2.3 E10 RS (10, 4) 1.4x 10x 3.3 E13 LRC (10, 6, 5) ● 3x 1.6x 5x 1.2 E15 Erasure coded data is hard to modify, but ideal for cold or read-only objects ● Cold storage tiering ● Will be used directly by radosgw
  • Multi-datacenter, geo-replication ● Ceph was originally designed for single DC clusters ● ● ● Synchronous replication Strong consistency Growing demand ● ● ● Enterprise: disaster recovery ISPs: replication data across sites for locality Two strategies: ● use-case specific: radosgw, RBD ● low-level capability in RADOS
  • RGW: Multi-site and async replication ● Multi-site, multi-cluster ● ● Zones: radosgw sub-cluster(s) within a region ● ● Regions: east coast, west coast, etc. Can federate across same or multiple Ceph clusters Sync user and bucket metadata across regions ● ● Global bucket/user namespace, like S3 Synchronize objects across zones ● Within the same region ● Across regions ● Admin control over which zones are master/slave
  • RBD: simple DR via snapshots ● Simple backup capability ● ● Based on block device snapshots Efficiently mirror changes between consecutive snapshots across clusters ● Now supported/orchestrated by OpenStack ● Good for coarse synchronization (e.g., hours) ● Not real-time
  • Async replication in RADOS ● One implementation to capture multiple use-cases ● ● RBD, CephFS, RGW, … RADOS A harder problem ● ● ● Scalable: 1000s OSDs → 1000s of OSDs Point-in-time consistency Three challenges ● Infer a partial ordering of events in the cluster ● Maintain a stable timeline to stream from – ● either checkpoints or event stream Coordinated roll-forward at destination – do not apply any update until we know we have everything that happened before it
  • CephFS → This is where it all started – let's get there ● Today ● ● ● QA coverage and bug squashing continues NFS and CIFS now large complete and robust Need ● ● Directory fragmentation ● Snapshots ● ● Multi-MDS QA investment Amazing community effort
  • The larger ecosystem
  • Big data When will be stop talking about MapReduce? Why is “big data” built on such a lame storage model? ● Move computation to the data ● Evangelize RADOS classes ● librados case studies and proof points ● Build a general purpose compute and storage platform
  • The enterprise How do we pay for all our toys? ● Support legacy and transitional interfaces ● ● ● iSCSI, NFS, pNFS, CIFS Vmware, Hyper-v Identify the beachhead use-cases ● Only takes one use-case to get in the door ● Earn others later ● Single platform – shared storage resource ● Bottom-up: earn respect of engineers and admins ● Top-down: strong brand and compelling product
  • Why we can beat the old guard ● It is hard to compete with free and open source software ● Unbeatable value proposition ● Ultimately a more efficient development model ● It is hard to manufacture community ● Strong foundational architecture ● Native protocols, Linux kernel support ● ● ● Unencumbered by legacy protocols like NFS Move beyond traditional client/server model Ongoing paradigm shift ● Software defined infrastructure, data center
  • Thank you, and Welcome!