Building Tomorrow's Ceph
Sage Weil
Research beginnings

9
UCSC research grant
●

“Petascale object storage”
●

DOE: LANL, LLNL, Sandia

●

Scalability

●

Reliability

●

Performan...
Distributed metadata management
●

Innovative design
●

Subtree-based partitioning for locality, efficiency

●

Dynamicall...
The rest of Ceph
●

RADOS – distributed object storage cluster (2005)

●

EBOFS – local object storage (2004/2006)

●

CRU...
Industry black hole
●

Many large storage vendors
●

●

Proprietary solutions that don't scale well

Few open source alter...
A different path
●

Change the world with open source
●

●

●

Do what Linux did to Solaris, Irix, Ultrix, etc.
What could...
Incubation

17
DreamHost!
●

Move back to LA, continue hacking

●

Hired a few developers

●

Pure development

●

No deliverables
Ambitious feature set
●

Native Linux kernel client (2007-)

●

Per-directory snapshots (2008)

●

Recursive accounting (2...
The kernel client
●

ceph-fuse was limited, not very fast

●

Build native Linux kernel implementation

●

Began attending...
Part of a larger ecosystem
●

Ceph need not solve all problems as monolithic stack

●

Replaced ebofs object file system w...
Budding community
●

#ceph on irc.oftc.net, ceph-devel@vger.kernel.org

●

Many interested users

●

A few developers

●

...
Road to product
●

●

DreamHost decides to build an S3-compatible object
storage service with Ceph
Stability
●

●

Focus o...
The reality
●

Growing incoming commercial interest
●

Early attempts from organizations large and small

●

Difficult to ...
Launch

26
Do it right
●

How do we build a strong open source company?

●

How do we build a strong open source community?

●

Model...
Goals
●

A stable Ceph release for production deployment
●

●

DreamObjects

Lay foundation for widespread adoption
●

Pla...
Branding
●

Early decision to engage professional agency
●

●

MetaDesign

Terms like
●

●

●

“Brand core”
“Design system...
Slick graphics
●

broken powerpoint template

31
Today: adoption

32
Traction
●

Too many production deployments to count
●

We don't know about most of them!

●

Too many customers (for me) ...
Quality
●

Increased adoption means increased demands on robust
testing

●

Across multiple platforms

●

Include platform...
Developer community
●

Significant external contributors

●

First-class feature contributions from contributors

●

Non-I...
CDS: Ceph Developer Summit
●

Community process for building project roadmap

●

100% online
●

Google hangouts

●

Wikis
...
The Future

38
Governance
How do we strengthen the project community?

●

2014 is the year

●

Might formally acknowledge my role as BDL
...
Technical roadmap
●

How do we reach new use-cases and users

●

How do we better satisfy existing users

●

How do we ens...
Tiering
●

●

Client side caches are great, but only buy so much.
Can we separate hot and cold data onto different storage...
Erasure coding
●

Replication for redundancy is flexible and fast

●

For larger clusters, it can be expensive
Storage
ove...
Multi-datacenter, geo-replication
●

Ceph was originally designed for single DC clusters
●

●

●

Synchronous replication
...
RGW: Multi-site and async replication
●

Multi-site, multi-cluster
●

●

Zones: radosgw sub-cluster(s) within a region

●
...
RBD: simple DR via snapshots
●

Simple backup capability
●

●

Based on block device snapshots
Efficiently mirror changes ...
Async replication in RADOS
●

One implementation to capture multiple use-cases
●

●

RBD, CephFS, RGW, … RADOS

A harder p...
CephFS
→ This is where it all started – let's get there

●

Today
●

●

●

QA coverage and bug squashing continues
NFS and...
The larger ecosystem
Big data
When will be stop talking about MapReduce?
Why is “big data” built on such a lame storage model?

●

Move computa...
The enterprise
How do we pay for all our toys?

●

Support legacy and transitional interfaces
●

●

●

iSCSI, NFS, pNFS, C...
Why we can beat the old guard
●

It is hard to compete with free and open source software
●

Unbeatable value proposition
...
Thank you, and Welcome!
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
Upcoming SlideShare
Loading in …5
×

London Ceph Day Keynote: Building Tomorrow's Ceph

1,161 views
1,064 views

Published on

Sage Weil, Creator of Ceph, Founder & CTO, Inktank

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,161
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
80
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • <number>
  • <number>
  • <number>
  • London Ceph Day Keynote: Building Tomorrow's Ceph

    1. 1. Building Tomorrow's Ceph Sage Weil
    2. 2. Research beginnings 9
    3. 3. UCSC research grant ● “Petascale object storage” ● DOE: LANL, LLNL, Sandia ● Scalability ● Reliability ● Performance ● ● Raw IO bandwidth, metadata ops/sec HPC file system workloads ● Thousands of clients writing to same file, directory
    4. 4. Distributed metadata management ● Innovative design ● Subtree-based partitioning for locality, efficiency ● Dynamically adapt to current workload ● Embedded inodes ● Prototype simulator in Java (2004) ● First line of Ceph code ● Summer internship at LLNL ● High security national lab environment ● Could write anything, as long as it was OSS
    5. 5. The rest of Ceph ● RADOS – distributed object storage cluster (2005) ● EBOFS – local object storage (2004/2006) ● CRUSH – hashing for the real world (2005) ● Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
    6. 6. Industry black hole ● Many large storage vendors ● ● Proprietary solutions that don't scale well Few open source alternatives (2006) ● ● Limited community and architecture (Lustre) ● ● Very limited scale, or No enterprise feature sets (snapshots, quotas) PhD grads all built interesting systems... ● ● ...and then went to work for Netapp, DDN, EMC, Veritas. They want you, not your project
    7. 7. A different path ● Change the world with open source ● ● ● Do what Linux did to Solaris, Irix, Ultrix, etc. What could go wrong? License ● ● ● GPL, BSD... LGPL: share changes, okay to link to proprietary code Avoid unsavory practices ● Dual licensing ● Copyright assignment
    8. 8. Incubation 17
    9. 9. DreamHost! ● Move back to LA, continue hacking ● Hired a few developers ● Pure development ● No deliverables
    10. 10. Ambitious feature set ● Native Linux kernel client (2007-) ● Per-directory snapshots (2008) ● Recursive accounting (2008) ● Object classes (2009) ● librados (2009) ● radosgw (2009) ● strong authentication (2009) ● RBD: rados block device (2010)
    11. 11. The kernel client ● ceph-fuse was limited, not very fast ● Build native Linux kernel implementation ● Began attending Linux file system developer events (LSF) ● ● ● Early words of encouragement from ex-Lustre devs Engage Linux fs developer community as peer Initial attempts merge rejected by Linus ● ● ● Not sufficient evidence of user demand A few fans and would-be users chimed in... Eventually merged for v2.6.34 (early 2010)
    12. 12. Part of a larger ecosystem ● Ceph need not solve all problems as monolithic stack ● Replaced ebofs object file system with btrfs ● ● Avoid reinventing the wheel ● Robust, well-supported, well optimized ● Kernel-level cache management ● ● Same design goals Copy-on-write, checksumming, other goodness Contributed some early functionality ● Cloning files ● Async snapshots
    13. 13. Budding community ● #ceph on irc.oftc.net, ceph-devel@vger.kernel.org ● Many interested users ● A few developers ● Many fans ● Too unstable for any real deployments ● Still mostly focused on right architecture and technical solutions
    14. 14. Road to product ● ● DreamHost decides to build an S3-compatible object storage service with Ceph Stability ● ● Focus on core RADOS, RBD, radosgw Paying back some technical debt ● ● ● Build testing automation Code review! Expand engineering team
    15. 15. The reality ● Growing incoming commercial interest ● Early attempts from organizations large and small ● Difficult to engage with a web hosting company ● No means to support commercial deployments ● Project needed a company to back it ● ● Build and test a product ● ● Fund the engineering effort Support users Bryan built a framework to spin out of DreamHost
    16. 16. Launch 26
    17. 17. Do it right ● How do we build a strong open source company? ● How do we build a strong open source community? ● Models? ● ● RedHat, Cloudera, MySQL, Canonical, … Initial funding from DreamHost, Mark Shuttleworth
    18. 18. Goals ● A stable Ceph release for production deployment ● ● DreamObjects Lay foundation for widespread adoption ● Platform support (Ubuntu, Redhat, SuSE) ● Documentation ● Build and test infrastructure ● Build a sales and support organization ● Expand engineering organization
    19. 19. Branding ● Early decision to engage professional agency ● ● MetaDesign Terms like ● ● ● “Brand core” “Design system” Project vs Company ● ● ● Shared / Separate / Shared core Inktank != Ceph Aspirational messaging: The Future of Storage
    20. 20. Slick graphics ● broken powerpoint template 31
    21. 21. Today: adoption 32
    22. 22. Traction ● Too many production deployments to count ● We don't know about most of them! ● Too many customers (for me) to count ● Growing partner list ● ● Lots of inbound Lots of press and buzz
    23. 23. Quality ● Increased adoption means increased demands on robust testing ● Across multiple platforms ● Include platforms we don't like ● Upgrades ● ● ● Rolling upgrades Inter-version compatibility Expanding user community + less noise about bugs = a good sign
    24. 24. Developer community ● Significant external contributors ● First-class feature contributions from contributors ● Non-Inktank participants in daily Inktank stand-ups ● External access to build/test lab infrastructure ● Common toolset ● ● Email (kernel.org) ● ● Github IRC (oftc.net) Linux distros
    25. 25. CDS: Ceph Developer Summit ● Community process for building project roadmap ● 100% online ● Google hangouts ● Wikis ● Etherpad ● First was this Spring, second is next week ● Great feedback, growing participation ● Indoctrinating our own developers to an open development model
    26. 26. The Future 38
    27. 27. Governance How do we strengthen the project community? ● 2014 is the year ● Might formally acknowledge my role as BDL ● Recognized project leads ● RBD, RGW, RADOS, CephFS) ● Formalize processes around CDS, community roadmap ● External foundation?
    28. 28. Technical roadmap ● How do we reach new use-cases and users ● How do we better satisfy existing users ● How do we ensure Ceph can succeed in enough markets for Inktank to thrive ● Enough breadth to expand and grow the community ● Enough focus to do well
    29. 29. Tiering ● ● Client side caches are great, but only buy so much. Can we separate hot and cold data onto different storage devices? ● ● ● ● Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding) How do you identify what is hot and cold? Common in enterprise solutions; not found in open source scale-out systems → key topic at CDS next week
    30. 30. Erasure coding ● Replication for redundancy is flexible and fast ● For larger clusters, it can be expensive Storage overhead 3x replication Repair traffic MTTDL (days) 1x 2.3 E10 RS (10, 4) 1.4x 10x 3.3 E13 LRC (10, 6, 5) ● 3x 1.6x 5x 1.2 E15 Erasure coded data is hard to modify, but ideal for cold or read-only objects ● Cold storage tiering ● Will be used directly by radosgw
    31. 31. Multi-datacenter, geo-replication ● Ceph was originally designed for single DC clusters ● ● ● Synchronous replication Strong consistency Growing demand ● ● ● Enterprise: disaster recovery ISPs: replication data across sites for locality Two strategies: ● use-case specific: radosgw, RBD ● low-level capability in RADOS
    32. 32. RGW: Multi-site and async replication ● Multi-site, multi-cluster ● ● Zones: radosgw sub-cluster(s) within a region ● ● Regions: east coast, west coast, etc. Can federate across same or multiple Ceph clusters Sync user and bucket metadata across regions ● ● Global bucket/user namespace, like S3 Synchronize objects across zones ● Within the same region ● Across regions ● Admin control over which zones are master/slave
    33. 33. RBD: simple DR via snapshots ● Simple backup capability ● ● Based on block device snapshots Efficiently mirror changes between consecutive snapshots across clusters ● Now supported/orchestrated by OpenStack ● Good for coarse synchronization (e.g., hours) ● Not real-time
    34. 34. Async replication in RADOS ● One implementation to capture multiple use-cases ● ● RBD, CephFS, RGW, … RADOS A harder problem ● ● ● Scalable: 1000s OSDs → 1000s of OSDs Point-in-time consistency Three challenges ● Infer a partial ordering of events in the cluster ● Maintain a stable timeline to stream from – ● either checkpoints or event stream Coordinated roll-forward at destination – do not apply any update until we know we have everything that happened before it
    35. 35. CephFS → This is where it all started – let's get there ● Today ● ● ● QA coverage and bug squashing continues NFS and CIFS now large complete and robust Need ● ● Directory fragmentation ● Snapshots ● ● Multi-MDS QA investment Amazing community effort
    36. 36. The larger ecosystem
    37. 37. Big data When will be stop talking about MapReduce? Why is “big data” built on such a lame storage model? ● Move computation to the data ● Evangelize RADOS classes ● librados case studies and proof points ● Build a general purpose compute and storage platform
    38. 38. The enterprise How do we pay for all our toys? ● Support legacy and transitional interfaces ● ● ● iSCSI, NFS, pNFS, CIFS Vmware, Hyper-v Identify the beachhead use-cases ● Only takes one use-case to get in the door ● Earn others later ● Single platform – shared storage resource ● Bottom-up: earn respect of engineers and admins ● Top-down: strong brand and compelling product
    39. 39. Why we can beat the old guard ● It is hard to compete with free and open source software ● Unbeatable value proposition ● Ultimately a more efficient development model ● It is hard to manufacture community ● Strong foundational architecture ● Native protocols, Linux kernel support ● ● ● Unencumbered by legacy protocols like NFS Move beyond traditional client/server model Ongoing paradigm shift ● Software defined infrastructure, data center
    40. 40. Thank you, and Welcome!

    ×