Zebras all the way down: The engineering challenges of the data path
Zebras all the way down
The engineering challenges of the data path
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
The luxury of statelessness
• In service-oriented software systems, we love statelessness
• And for good reason: stateless components — like finite state
machines — lend systems many desirable properties!
• Stateless components can be easily made immutable, scalable,
re-deployable, restartable, upgradeable, etc. etc.
• Of course, persistent state still very much exists — we just
use separation of concerns to confine the management of state
to those services that do it explicitly and exclusively…
The data path
• The data path consists of the software, hardware, and firmware
components between a service endpoint that offers persistence
and the implementation of that persistence
• The data path always ends in non-volatile storage, which (for
now, anyway) means either flash or magnetic media
• The data path traverses many subsystems and components —
and nearly always is a distributed system itself
• We place great demands upon the data path…
The demands of the data path
• A data path that merely works much of the time is insufficient
• We (rightfully) expect perfection from the data path: we expect it
to be consistent, available and partition-tolerant!
• Of course, Brewer’s CAP theorem tells us that this isn’t actually
possible — we must make tradeoffs
• Even a well-engineered system can’t beat CAP — but a poorly
engineered one will be flailed by it, becoming pathologically
unavailable or inconsistent
• Zebras are the difference
Zebras?
• In American medical slang, a zebra is a rare and exotic
condition that can be conflated with more common ailments
• Medical students and residents are cautioned against
diagnosing them, to the point of aphorism: “when you hear
hoofbeats, think of horses not zebras”
• But — as anyone who has been afflicted by one will affirm —
zebras emphatically exist!
Zebras in the data path?
• Even though the data path runs on and ends with hardware, it
consists of many disjoint and unseen software components
• The paradox of software (especially that of the data path!) is
that software is both information and machine
• When software works correctly, it survives as information does:
namely, in perpetuity
• Especially where software is expensive to write and difficult to
fix, there is an overwhelming bias towards extant software
• Over time, the horses are found; only the zebras are left
Hunting zebra
• We must assume that unusual pathologies — especially in a
distributed system — will not be readily reproducible!
• When we are culturally afflicted with “bias for action”, it
becomes tempting to immediately change the system to fix it
• This is the wrong first motion: the choice to restore service
versus understanding it is often a false dichotomy!
• We must not change the system but rather observe it — we
must focus not on snap hypotheses, but rather initial questions
• The observability of the system is paramount!
Observability at Joyent
• Observability is an organizing principle at Joyent — it is a
primary reason that we run SmartOS, our illumos derivative
• Manta — our (open source, container-centric) object storage
service — has SmartOS and ZFS at its core
• Manta uses sharded PostgreSQL for metadata (+ ZooKeeper
for leader election), with services primarily in node.js
• We invested heavily in the observability and debuggability of
node.js — and it is a (the?) reason we still use node.js
Observability at Joyent Samsung!
• Out of desire to build their own cloud based on Manta and Triton
(our open source cloud management system), Samsung bought
Joyent in June 2016
• While Manta has been in production for several years,
Samsung’s level of scale has brought new-found challenges
• Good news: between several years of production + observability
(logging, DTrace, mdb) + hyperscale post-Samsung, we have
nailed many thorny problems in Manta
• Bad news: our stack — and that of every data path — has
components that we still struggle to observe and debug…
Zebra sanctuary
• Unfortunately, the data path is laced with proprietary software
that can’t be observed, audited, verified, or debugged
• This is the software that interacts so directly with the hardware
as to create the illusion of hardware to higher-level software
• This is firmware, and it runs so dark and deep in the data path
that much of it is impossible to see or catalogue
• Firmware that operates silently will also fail implicitly — it is
hardware failing with software’s failure modes
Zebras in the spindle
• Rotating magnetic media is a modern mechanical marvel
• With sealed enclosures and helium-based drives, densities
continue to increase — the disk will be with us for a long time!
• Disks are vulnerable to vibe, temperature, particulates,
aspersions, wear, etc. — magnetic media will fail!
• But the disk knows this, and sophisticated on-head/on-controller
firmware steers around failed media…
• …leaving much nastier failure modes
Zebras in the spindle
• Disks can (emphatically!) read or write the wrong data
• Seeing this coming reality in the early 2000s, ZFS was designed
around total data path integrity via indirect checksums
• ZFS has discovered all manner of data corruption in storage
systems putatively too expensive to suffer such problems…
• And yet even ZFS oversimplified the failure modes of disks: 15+
years of deploying ZFS, we have seen disks fail in much more
exotic ways than we thought possible
Zebras in the SSD
• Flash wears out so frequently and quickly that much of an SSD
is managing wear and mapping operations to functional flash
• There are entire universes of system software in every SSD!
• SSDs have incredible variety in their operating envelopes —
and can accordingly fail in wildly divergent ways
• This can represent systemic risk in that many SSDs can fail in
the same way at the same time…
• Confession: We’ve been so concerned about a flashtastrophy
that we have always grossly over-engineered our own SSDs
Zebras in the HBA
• The host bus adapter is responsible for brokering I/O from the
operating system to the physical devices
• This is more complicated than it might seem — and in particular,
HBA firmware is infamous for losing I/O under load
• From the perspective of system software this will be an I/O that
never returns — which means it will be timed out and retried
• While the system will maintain liveness, this will induce a
latency outlier — which can manifest itself far up the stack (e.g.,
TCP resets!)
Zebras in the DIMM
• DRAM is a capacitor that must be periodically refreshed
• DRAM is susceptible to fatal failures (e.g., corrosion due to
humidity, temperature or other environmental failures)
• As the speed and density of DRAM have increased (and the
voltage has dropped), DRAM has become more susceptible to
transient bit failure not due to any hardware malfunction
• The “Firmware First” (!) model of error handling in x86 (and the
demise of CMCI) is leading to a silent epidemic of DIMM failure!
Zebras in the chassis
• Even the chassis itself is not immune from software failure
• For example, software and firmware control fan speed — and
failures in that software can result in fans stuck running at their
highest speed
• Fans are not designed to run at full power for extended periods
of time; they wear out or (worse) induce vibration in the chassis
• The effects of (say) vibration will be felt far from the source —
and again, may only manifest latency not explicit failure
Zebras in the NIC
• Failure in the network interface card can be due to NIC firmware
failure or hardware failure (e.g., the optical transceiver)
• Networking failure should be entirely survivable by a distributed
system, but that doesn’t mean it’s without consequence!
• Use of the link aggregation control protocol (LACP) seems
tempting — but can requires more sophisticated software in the
switch (i.e., MLAG)…
• …which itself can lead to new failure modes!
Zebras in the top-of-rack switch
• As their own complicated ecosystem of software and firmware,
top-of-rack switches are prone to software failure
• Failure in the top-of-rack (or worse, the L3 core) can have an
enormous blast radius in a distributed system…
• For example, a switch that drops its ARP tables can result in a
distributed system going massively split brain…
• Or a switch that gets stuck broadcasting traffic can easily DDOS
an entire distributed system — revealing that there is a single-
point-of-failure after all!
Zebras all the way up
• These problems do not manifest themselves cleanly at the point
of origin for reasons both pragmatic and economic
• Hardware vendors don’t want gear shipped back for RCCA!
• Arguably, unreliable components allow (force?) upstack
software to discover its novel failure modes
• But that is an argument for debugging and resolving those
(additional) problems upstack, not for unreliable components!
Don’t fear the zebra
• The data path is not to be undertaken lightly
• Do not assume that testing and monitoring can substitute for
system understanding; enshrine observability
• Reward complete understanding, not merely resolution!
• As long as it’s unobservable, firmware is the enemy — and
trends toward sophisticated firmware are especially troubling!
• Open source software affords us a quality ratchet: we
shouldn’t spend our careers re-solving the same problems!
Further reading and viewing
• For an enlightening (and more positive) take on firmware, check
out the amazing videos of Micah Elizabeth Scott (@scanlime)
• For a snapshot of what we’re currently working on and thinking
about with respect to Manta/Triton, see the Joyent Requests for
Discussion (RFDs) — especially RFD 89 (“Project Tiresias”)
• For more on node.js debuggability, see Dave Pacheco’s talk on
“Industrial-grade node.js”
• Also, thank you to Amanda Lundberg of White Coat Captioning
for the superhuman real-time captioning!