4. Durability Defined
§ If written data is acknowledged, it must
be forever readable
§ If written data is read once [before it is
acknowledged], it must be forever
readable
5. Nothing is Forever
§ Hardware eventually fails
§ Software eventually (?) works
§ Durability is a matter of degree
§ What is good enough?
7. Performance is the Enemy
§ “The only good write is an O_SYNC write”
§ Write-behind, caching, background
compaction/migration can all lead to
hidden errors
§ fsync(2) can and should return errors, but
misses some
§ See https://wiki.postgresql.org/wiki/Fsync_Errors
§ PostgreSQL: Caring about durability since 1986
§ “commit intervals”?
8. Can’t trust a File System
“We analyze 11 applications, and find 60 vulnerabilities,
some of which result in severe consequences like
corruption or data loss.”
9. Can’t trust an SSD
‘Surprisingly, we find that 13 out of the 15 devices,
including the supposedly “enterprise-class” devices,
exhibit failure behavior contrary to our expectations’
10. Servers and Mayflies
§ Back in the day, when “the” computer
crashed, you just waited for repair
§ Now you remove or re-image the server –
with the drives
§ Local durability is really hard,
but no longer adequate
11. Replication
§ Backups? Not timely
§ Synchronous mirroring? Very expensive
§ Just use the network! Make copies! Go
forth and replicate!
§ Losing a disk or server no longer causes
lost data. Right? Who needs fsync?
12. Correlated Failures
§ AWS can lose a data center, you can too
§ Rack power problems are common
§ The smaller your cluster, the more
vulnerable it is
https://xkcd.com/1737
14. CAP Theorem
§ You will have Partitioning.
§ You must choose between Availability
and Consistency.
§ Your users will hate your choice.
§ Availability can be improved by brute
force and $$$ - to reduce partitioning.
§ Consistency requires consensus.
16. Jepsen breaks everything
“Use Zookeeper. It’s mature, well-designed, and battle-tested.”
“The etcd and Consul teams both take consistency seriously…”
Kyle Kingsbury, https://jepsen.io
17. Logs & Journals
§ Application first writes to log, then to
where the data “really lives”
§ FS writes to journal, then to where the
data “really lives”
§ Device writes to log, then to where the
data “really lives”
§ What if “the truth” “really lived” in the log?
§ The other places become read caches
18. Table and Stream Duality
§ “A table is just a cache of the latest value
for each key in a stream” – P. Helland
§ Logs are great for streaming data
§ What if the log itself is distributed and
allows many writers and readers?
20. Apache Bookkeeper™
§ “A scaleable, fault-tolerant, and low-
latency storage service optimized for
real-time workloads”
§ Guarantees:
§ “If an entry has been acknowledged, it must
be readable”
§ “If an entry has been read once, it must
always be readable”
21. Bookkeeper Components
§ Client-side library
§ Distributed Ledger Abstraction
§ “Bookie” – very simple storage nodes
§ Bookies do NOT talk to each other
§ Zookeeper coordination, consensus,
cluster membership, and quorums
23. Planet Java
§ Zookeeper and Bookkeeper are both
from planet Java
§ How about something more friendly to
Planet Linux?
§ Use etcd, rewrite Bookkeeper like
ScyllaDB did for Cassandra?
24. Take-aways
§ Durability is Hard
§ Distributed Durability is Very Hard
§ Be Up-Front about your durability model
§ Logs as Truth & Streaming are the future
§ Apache Bookkeeper is awesome
§ Don’t re-invent the wheel!
25. Q & A
Software Composable Infrastructure
for modern workloads
and commodity hardware.