London Ceph Day Keynote: Building Tomorrow's Ceph

Building Tomorrow's Ceph
Sage Weil

UCSC research grant
●

“Petascale object storage”
●

DOE: LANL, LLNL, Sandia

●

Scalability

●

Reliability

●

Performance
●

●

Raw IO bandwidth, metadata ops/sec

HPC file system workloads
●

Thousands of clients writing to same file, directory

Distributed metadata management
●

Innovative design
●

Subtree-based partitioning for locality, efficiency

●

Dynamically adapt to current workload

●

Embedded inodes

●

Prototype simulator in Java (2004)

●

First line of Ceph code
●

Summer internship at LLNL

●

High security national lab environment

●

Could write anything, as long as it was OSS

The rest of Ceph
●

RADOS – distributed object storage cluster (2005)

●

EBOFS – local object storage (2004/2006)

●

CRUSH – hashing for the real world (2005)

●

Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture

Industry black hole
●

Many large storage vendors
●

●

Proprietary solutions that don't scale well

Few open source alternatives (2006)
●

●

Limited community and architecture (Lustre)

●

●

Very limited scale, or
No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems...
●

●

...and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project

A different path
●

Change the world with open source
●

●

●

Do what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?

License
●

●

●

GPL, BSD...
LGPL: share changes, okay to link to proprietary code

Avoid unsavory practices
●

Dual licensing

●

Copyright assignment

DreamHost!
●

Move back to LA, continue hacking

●

Hired a few developers

●

Pure development

●

No deliverables

Ambitious feature set
●

Native Linux kernel client (2007-)

●

Per-directory snapshots (2008)

●

Recursive accounting (2008)

●

Object classes (2009)

●

librados (2009)

●

radosgw (2009)

●

strong authentication (2009)

●

RBD: rados block device (2010)

The kernel client
●

ceph-fuse was limited, not very fast

●

Build native Linux kernel implementation

●

Began attending Linux file system developer events (LSF)
●

●

●

Early words of encouragement from ex-Lustre devs
Engage Linux fs developer community as peer

Initial attempts merge rejected by Linus
●

●

●

Not sufficient evidence of user demand
A few fans and would-be users chimed in...

Eventually merged for v2.6.34 (early 2010)

Part of a larger ecosystem
●

Ceph need not solve all problems as monolithic stack

●

Replaced ebofs object file system with btrfs
●

●

Avoid reinventing the wheel

●

Robust, well-supported, well optimized

●

Kernel-level cache management

●

●

Same design goals

Copy-on-write, checksumming, other goodness

Contributed some early functionality
●

Cloning files

●

Async snapshots

Budding community
●

#ceph on irc.oftc.net, ceph-devel@vger.kernel.org

●

Many interested users

●

A few developers

●

Many fans

●

Too unstable for any real deployments

●

Still mostly focused on right architecture and technical
solutions

Road to product
●

●

DreamHost decides to build an S3-compatible object
storage service with Ceph
Stability
●

●

Focus on core RADOS, RBD, radosgw

Paying back some technical debt
●

●

●

Build testing automation
Code review!

Expand engineering team

The reality
●

Growing incoming commercial interest
●

Early attempts from organizations large and small

●

Difficult to engage with a web hosting company

●

No means to support commercial deployments

●

Project needed a company to back it
●

●

Build and test a product

●

●

Fund the engineering effort
Support users

Bryan built a framework to spin out of DreamHost

Do it right
●

How do we build a strong open source company?

●

How do we build a strong open source community?

●

Models?
●

●

RedHat, Cloudera, MySQL, Canonical, …

Initial funding from DreamHost, Mark Shuttleworth

Goals
●

A stable Ceph release for production deployment
●

●

DreamObjects

Lay foundation for widespread adoption
●

Platform support (Ubuntu, Redhat, SuSE)

●

Documentation

●

Build and test infrastructure

●

Build a sales and support organization

●

Expand engineering organization

Branding
●

Early decision to engage professional agency
●

●

MetaDesign

Terms like
●

●

●

“Brand core”
“Design system”

Project vs Company
●

●

●

Shared / Separate / Shared core
Inktank != Ceph

Aspirational messaging: The Future of Storage

Slick graphics
●

broken powerpoint template

31

Traction
●

Too many production deployments to count
●

We don't know about most of them!

●

Too many customers (for me) to count

●

Growing partner list
●

●

Lots of inbound

Lots of press and buzz

Quality
●

Increased adoption means increased demands on robust
testing

●

Across multiple platforms

●

Include platforms we don't like

●

Upgrades
●

●

●

Rolling upgrades
Inter-version compatibility

Expanding user community + less noise about bugs = a
good sign

Developer community
●

Significant external contributors

●

First-class feature contributions from contributors

●

Non-Inktank participants in daily Inktank stand-ups

●

External access to build/test lab infrastructure

●

Common toolset
●

●

Email (kernel.org)

●

●

Github
IRC (oftc.net)

Linux distros

CDS: Ceph Developer Summit
●

Community process for building project roadmap

●

100% online
●

Google hangouts

●

Wikis

●

Etherpad

●

First was this Spring, second is next week

●

Great feedback, growing participation

●

Indoctrinating our own developers to an open
development model

Governance
How do we strengthen the project community?

●

2014 is the year

●

Might formally acknowledge my role as BDL

●

Recognized project leads
●

RBD, RGW, RADOS, CephFS)

●

Formalize processes around CDS, community roadmap

●

External foundation?

Technical roadmap
●

How do we reach new use-cases and users

●

How do we better satisfy existing users

●

How do we ensure Ceph can succeed in enough markets
for Inktank to thrive

●

Enough breadth to expand and grow the community

●

Enough focus to do well

Tiering
●

●

Client side caches are great, but only buy so much.
Can we separate hot and cold data onto different storage
devices?
●

●

●

●

Cache pools: promote hot objects from an existing pool into a fast
(e.g., FusionIO) pool
Cold pools: demote cold data to a slow, archival pool (e.g.,
erasure coding)

How do you identify what is hot and cold?
Common in enterprise solutions; not found in open source
scale-out systems
→ key topic at CDS next week

Erasure coding
●

Replication for redundancy is flexible and fast

●

For larger clusters, it can be expensive
Storage
overhead
3x replication

Repair
traffic

MTTDL
(days)

1x

2.3 E10

RS (10, 4)

1.4x

10x

3.3 E13

LRC (10, 6, 5)
●

3x
1.6x

5x

1.2 E15

Erasure coded data is hard to modify, but ideal for cold or
read-only objects
●

Cold storage tiering

●

Will be used directly by radosgw

Multi-datacenter, geo-replication
●

Ceph was originally designed for single DC clusters
●

●

●

Synchronous replication
Strong consistency

Growing demand
●

●

●

Enterprise: disaster recovery
ISPs: replication data across sites for locality

Two strategies:
●

use-case specific: radosgw, RBD

●

low-level capability in RADOS

RGW: Multi-site and async replication
●

Multi-site, multi-cluster
●

●

Zones: radosgw sub-cluster(s) within a region

●

●

Regions: east coast, west coast, etc.
Can federate across same or multiple Ceph clusters

Sync user and bucket metadata across regions
●

●

Global bucket/user namespace, like S3

Synchronize objects across zones
●

Within the same region

●

Across regions

●

Admin control over which zones are master/slave

RBD: simple DR via snapshots
●

Simple backup capability
●

●

Based on block device snapshots
Efficiently mirror changes between consecutive snapshots across
clusters

●

Now supported/orchestrated by OpenStack

●

Good for coarse synchronization (e.g., hours)
●

Not real-time

Async replication in RADOS
●

One implementation to capture multiple use-cases
●

●

RBD, CephFS, RGW, … RADOS

A harder problem
●

●

●

Scalable: 1000s OSDs → 1000s of OSDs
Point-in-time consistency

Three challenges
●

Infer a partial ordering of events in the cluster

●

Maintain a stable timeline to stream from
–

●

either checkpoints or event stream

Coordinated roll-forward at destination
–

do not apply any update until we know we have everything that
happened before it

CephFS
→ This is where it all started – let's get there

●

Today
●

●

●

QA coverage and bug squashing continues
NFS and CIFS now large complete and robust

Need
●

●

Directory fragmentation

●

Snapshots

●

●

Multi-MDS

QA investment

Amazing community effort

Big data
When will be stop talking about MapReduce?
Why is “big data” built on such a lame storage model?

●

Move computation to the data

●

Evangelize RADOS classes

●

librados case studies and proof points

●

Build a general purpose compute and storage platform

The enterprise
How do we pay for all our toys?

●

Support legacy and transitional interfaces
●

●

●

iSCSI, NFS, pNFS, CIFS
Vmware, Hyper-v

Identify the beachhead use-cases
●

Only takes one use-case to get in the door

●

Earn others later

●

Single platform – shared storage resource

●

Bottom-up: earn respect of engineers and admins

●

Top-down: strong brand and compelling product

Why we can beat the old guard
●

It is hard to compete with free and open source software
●

Unbeatable value proposition

●

Ultimately a more efficient development model

●

It is hard to manufacture community

●

Strong foundational architecture

●

Native protocols, Linux kernel support
●

●

●

Unencumbered by legacy protocols like NFS
Move beyond traditional client/server model

Ongoing paradigm shift
●

Software defined infrastructure, data center

London Ceph Day Keynote: Building Tomorrow's Ceph

More Related Content

What's hot

Similar to London Ceph Day Keynote: Building Tomorrow's Ceph

Recently uploaded

London Ceph Day Keynote: Building Tomorrow's Ceph

Editor's Notes