ZFS & Zones:
Your Compute fell into
My Data!
Bryan Cantrill
SVP, Engineering
bryan@joyent.com
@bcantrill
The filesystem: Some prehistory

•

When they were originally developed in the 1970s,
filesystems were designed as an abstra...
The volume management divide

•

Volume management abstracts many physical devices
into single logical volumes, allowing fi...
Volume management deficiencies

•

Because the volume management layer had no notion
of the transactional semantics of the ...
Volume management deficiencies

•

Lacking visibility into the hardware layer, the filesystem
could not effectively use the ...
The ZFS revolution

•

Starting in 2001, Sun began a revolutionary new
software effort: to unify storage and eliminate the...
ZFS advantages

•

Copy-on-write design allows on-disk consistency to be
always assured (eliminating file system check)

•
...
ZFS at Joyent

•

Joyent was the earliest ZFS adopter: becoming (in
2005) the first production user of ZFS outside of Sun

...
ZFS as the basis for object storage?

•
•

We view ZFS as our most foundational differentiator...

•

Could we extend ZFS ...
Aside: Virtualization in the cloud

•

Operating a public cloud has significant technological
and business challenges:

•

...
Hardware-level virtualization?

•

The historical answer — since the 1960s — has been to
virtualize at the level of the ha...
Platform-level virtualization?

•

Virtualizing at the application platform layer addresses
the tenancy challenges of hard...
Joyent’s solution: OS-level virtualization

•

Virtualizing at the OS level hits the sweet spot:

•

Single OS (single ker...
Idea: ZFS + Zones?
Manta: ZFS + Zones!

•

Building a sophisticated distributed system on top of
ZFS and zones, we have built Manta, an inter...
Aside: Unix

•

When Unix appeared in the early 1970s, it was not just a
new system, but a new way of thinking about syste...
Unix: Let there be light

•

In 1969, Doug McIlroy had the idea of connecting
different components:
At the same time that ...
Unix: ...and there was light

And the next morning we had this
orgy of one-liners. — Doug McIlroy
The Unix philosophy

•

The pipe — coupled with the small-system aesthetic —
gave rise to the Unix philosophy, as articula...
Doug McIlroy v. Don Knuth: FIGHT!

•

In 1986, Jon Bentley posed the challenge that became
the Epic Rap Battle of computer...
Big Data: History repeats itself?

•

The original Google MapReduce paper (Dean et al.,
OSDI ’04) poses a problem disturbi...
Manta: Unix for Big Data

•

Manta allows for an arbitrarily scalable variant of
McIlroy’s solution to Bentley’s challenge...
Manta: CAP tradeoffs

•

Eventual consistency represents the wrong CAP
tradeoffs for most; we prefer consistency over
avai...
Manta: Other design principles

•

Hierarchical storage is an excellent idea (ht: Multics);
Manta implements proper direct...
Manta and the future of big data

•

We believe compute/data convergence to be the future
of big data: stores of record mu...
Manta: More information

•

Product page:
http://joyent.com/products/manta

•

node.js module:
https://github.com/joyent/n...
Upcoming SlideShare
Loading in …5
×

Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

14,233 views

Published on

As the amount of unstructured data has greatly exceeded a single computer's ability to process it, data has become increasingly isolated from the compute elements . The resulting haul from stores of record (e.g., SAN, NAS, S3) to transient compute (e.g., Hadoop, EC2) creates needless mechanical work and human labor. Is there a better way? In this talk, we'll explore the coming convergence of data and compute in the cloud, focusing in particular on Joyent's Manta, a new internet-facing object storage facility that features compute. We will describe the design principles for Manta, the engineering challenges in building it, and more generally, the opportunities presented by the convergence of compute and data.

Published in: Technology

Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

  1. 1. ZFS & Zones: Your Compute fell into My Data! Bryan Cantrill SVP, Engineering bryan@joyent.com @bcantrill
  2. 2. The filesystem: Some prehistory • When they were originally developed in the 1970s, filesystems were designed as an abstraction over a disk • Over time, it became increasingly expensive to make bigger disks — and reliability suffered • In the 1980s, both problems were solved by using many hard-drives instead of just larger and large drives: a redundant array of inexpensive disks (RAID) • Even though filesystems were still relatively young at the time, it was deemed too complicated to rewrite them to accommodate the (new) notion of many disks • This software problem was solved by introducing a new layer of software: the volume manager
  3. 3. The volume management divide • Volume management abstracts many physical devices into single logical volumes, allowing filesystems retained a one-to-one mapping with a device (a logical one) • This gave rise to a problematic divide: • • • The volume manager understands multiple disks, but nothing of the higher level semantics of the filesystem The filesystem understands the higher semantics of the data, but has no physical device understanding This divide became entrenched over the 1990s, and had devastating ramifications for reliability, performance and manageability
  4. 4. Volume management deficiencies • Because the volume management layer had no notion of the transactional semantics of the filesystem, system failure induced excruciating file system checks • Worse, the system was left with no protection against many variants of device-level data corruption: • • • The only failure the volume manager can reasonably detect is media failure that results in incorrect data on disk This doesn’t account for phantom reads (i.e., the wrong disk block is read from), phantom writes (i.e., the wrong disk block is written to) or driver pathologies (e.g. memory errors) And because they did not understand more than one device, device failure often meant filesystem failure
  5. 5. Volume management deficiencies • Lacking visibility into the hardware layer, the filesystem could not effectively use the parallelism inherent in multiple disks — and could not effectively schedule I/O • Spindles were underutilized (leaving bandwidth and/or IOPS on the table) or overutilized (thrashing the device and yielding pathological performance • Management was a nightmare: filesystems could not be expanded or shrunk — requiring every filesystem to know in advance its intended capacity
  6. 6. The ZFS revolution • Starting in 2001, Sun began a revolutionary new software effort: to unify storage and eliminate the divide • In this model, filesystems would lose their one-to-one association with devices: many filesystems would be multiplexed on many devices • By starting with a clean sheet of paper, ZFS opened up vistas of innovation — and by its architecture was able to solve many otherwise intractable problems • Sun shipped ZFS in 2005, and used it as the foundation of its enterprise storage products starting in 2008 • ZFS was open sourced in 2005; it remains the only open source enterprise-grade filesystem
  7. 7. ZFS advantages • Copy-on-write design allows on-disk consistency to be always assured (eliminating file system check) • Copy-on-write design allows constant-time snapshots in unlimited quantity — and writable clones! • Filesystem architecture allows filesystems to be created instantly and expanded — or shrunk! — on-the-fly • Integrated volume management allows for intelligent device behavior with respect to disk failure and recovery • Adaptive replacement cache (ARC) allows for optimal use of DRAM — especially on high DRAM systems • Support for dedicated log and cache devices allows for optimal use of flash-based SSDs
  8. 8. ZFS at Joyent • Joyent was the earliest ZFS adopter: becoming (in 2005) the first production user of ZFS outside of Sun • ZFS is one of the four foundational technologies of Joyent’s SmartOS, our illumos derivative • • • The other three foundational technologies in SmartOS are DTrace, Zones and KVM Search “fork yeah illumos” for the (uncensored) history of OpenSolaris, illumos, SmartOS and derivatives Joyent has extended ZFS to provide better support multi-tenant operation with I/O throttling
  9. 9. ZFS as the basis for object storage? • • We view ZFS as our most foundational differentiator... • Could we extend ZFS in some important way that would offer something interesting and compelling? • Short answer: meh As we began to think about building our own internet facing object store in the fall of 2011, we naturally gravitated to ZFS...
  10. 10. Aside: Virtualization in the cloud • Operating a public cloud has significant technological and business challenges: • From a technological perspective, must deliver highly elastic infrastructure with acceptable quality of service across a broad class of users and applications • From a business perspective, must drive utilization as high as possible while still satisfying customer expectations for quality of service • These aspirations are in tension: multi-tenancy can significantly degrade quality of service • The key enabling technology for multi-tenancy is virtualization — but where in the stack to virtualize?
  11. 11. Hardware-level virtualization? • The historical answer — since the 1960s — has been to virtualize at the level of the hardware: • A virtual machine is presented upon which each tenant runs an operating system of their choosing • There are as many operating systems as tenants • The historical motivation for hardware virtualization remains its advantage today: it can run entire legacy stacks unmodified • However, hardware virtualization exacts a heavy tolls: operating systems are not designed to share resources like DRAM, CPU, I/O devices or the network • Hardware virtualization limits tenancy and inhibits performance!
  12. 12. Platform-level virtualization? • Virtualizing at the application platform layer addresses the tenancy challenges of hardware virtualization… • • ...but at the cost of dictating abstraction to the developer • Virtualizing at the application platform layer poses many other challenges: This creates the “Google App Engine problem”: developers are in a straightjacket where toy programs are easy — but sophisticated apps are impossible • Security, resource containment, language specificity, environment-specific engineering costs
  13. 13. Joyent’s solution: OS-level virtualization • Virtualizing at the OS level hits the sweet spot: • Single OS (single kernel) allows for efficient use of hardware resources, and therefore allows load factors to be high • Disjoint instances are securely compartmentalized by the operating system • Gives customers what appears to be a virtual machine (albeit a very fast one) on which to run higher-level software • Gives customers PaaS when the abstractions work for them, IaaS when they need more generality • OS-level virtualization allows for high levels of tenancy without dictating abstraction or sacrificing efficiency • Zones is a bullet-proof implementation of OS-level virtualization — and is the core abstraction in Joyent’s SmartOS
  14. 14. Idea: ZFS + Zones?
  15. 15. Manta: ZFS + Zones! • Building a sophisticated distributed system on top of ZFS and zones, we have built Manta, an internet-facing object storage system offering in situ compute • That is, the description of compute can be brought to where objects reside instead of having to backhaul objects to transient compute • The abstractions made available for computation are anything that can run on the OS... • ...and as a reminder, the OS — Unix — was built around the notion of ad hoc unstructured data processing, and allows for remarkably terse expressions of computation
  16. 16. Aside: Unix • When Unix appeared in the early 1970s, it was not just a new system, but a new way of thinking about systems • Instead of a sealed monolith, the operating system was a collection of small, easily understood programs • First Edition Unix (1971) contained many programs that we still use today (ls, rm, cat, mv) • Its very name conveyed this minimalist aesthetic: Unix is a homophone of “eunuchs” — a castrated Multics We were a bit oppressed by the big system mentality. Ken wanted to do something simple. — Dennis Ritchie
  17. 17. Unix: Let there be light • In 1969, Doug McIlroy had the idea of connecting different components: At the same time that Thompson and Ritchie were sketching out a file system, I was sketching out how to do data processing on the blackboard by connecting together cascades of processes • This was the primordial pipe, but it took three years to persuade Thompson to adopt it: And one day I came up with a syntax for the shell that went along with the piping, and Ken said, “I’m going to do it!”
  18. 18. Unix: ...and there was light And the next morning we had this orgy of one-liners. — Doug McIlroy
  19. 19. The Unix philosophy • The pipe — coupled with the small-system aesthetic — gave rise to the Unix philosophy, as articulated by Doug McIlroy: • • Write programs to work together • • Write programs that do one thing and do it well Write programs that handle text streams, because that is a universal interface Four decades later, this philosophy remains the single most important revolution in software systems thinking!
  20. 20. Doug McIlroy v. Don Knuth: FIGHT! • In 1986, Jon Bentley posed the challenge that became the Epic Rap Battle of computer science history: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies. • Don Knuth’s solution: an elaborate program in WEB, a Pascal-like literate programming system of his own invention, using a purpose-built algorithm • Doug McIlroy’s solution shows the power of the Unix philosophy: tr -cs A-Za-z 'n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
  21. 21. Big Data: History repeats itself? • The original Google MapReduce paper (Dean et al., OSDI ’04) poses a problem disturbingly similar to Bentley’s challenge nearly two decades prior: Count of URL Access Frequency: The function processes logs of web page requests and outputs ⟨URL, 1⟩. The reduce function adds together all values for the same URL and emits a ⟨URL, total count⟩ pair • • But the solutions do not adhere to the Unix philosophy... • e.g., Appendix A of the OSDI ’04 paper has a 71 line word count in C++ — with nary a wc in sight ...and nor do they make use of the substantial Unix foundation for data processing
  22. 22. Manta: Unix for Big Data • Manta allows for an arbitrarily scalable variant of McIlroy’s solution to Bentley’s challenge: mfind -t o /bcantrill/public/v7/usr/man | mjob create -o -m "tr -cs A-Za-z 'n' | tr A-Z a-z | sort | uniq -c" -r "awk '{ x[$2] += $1 } END { for (w in x) { print x[w] " " w } }' | sort -rn | sed ${1}q" • This description not only terse, it is high performing: data is left at rest — with the “map” phase doing heavy reduction of the data stream • As such, Manta — like Unix — is not merely syntactic sugar; it converges compute and data in a new way
  23. 23. Manta: CAP tradeoffs • Eventual consistency represents the wrong CAP tradeoffs for most; we prefer consistency over availability for writes (but still availability for reads) • Many more details: http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/ • Celebrity endorsement:
  24. 24. Manta: Other design principles • Hierarchical storage is an excellent idea (ht: Multics); Manta implements proper directories, delimited with a forward slash • Manta implements a snapshot/link hybrid dubbed a snaplink; can be used to effect versioning • • Manta has full support for CORS headers • • Manta SDKs exist for node.js, Java, Ruby, Python Manta uses SSH-based HTTP auth for client-side tooling (IETF draft-cavage-http-signatures-00) “npm install manta” for command line interface
  25. 25. Manta and the future of big data • We believe compute/data convergence to be the future of big data: stores of record must support computation as a first-class, in situ operation • We believe that Unix is a natural way of expressing this computation — and that the OS is the right level at which to virtualize to support this securely • We believe that ZFS is the only sane storage underpinning for such a system • Manta will surely not be the only system to represent the confluence of these — but it is the first • We are actively retooling our software stack in terms of Manta — Manta is changing the way we develop software!
  26. 26. Manta: More information • Product page: http://joyent.com/products/manta • node.js module: https://github.com/joyent/node-manta • Manta documentation: http://apidocs.joyent.com/manta/ • IRC, e-mail, Twitter, etc.: #manta on freenode, manta@joyent.com, @mcavage, @dapsays, @yunongx, @joyent

×