Background• WAS: – key participant design team (10/2009-04/2010) – vocal proponent of final consensus architecture• WAS NOT: – Part of actual implementation – Involved in FB chat backend• IS NOT: Facebook Employee since 08/2011
Problem – Cont.• Must have: Cross continent copy – Ideally concurrent updates across regions – At least disaster recoverable• No Single Failure Points for Entire Service – FB has downtime on a few MySql databases/day – No one cares• Cannot Cannot Cannot lose Data
Solved Problem• Attachment Store – HayStack – Stores FB Photos – Optimized for Immutable Data• Hiring best programmers available – Choose best design, not implementation – But get things done Fast
Write Throughput• Disk: – Need Log Structured container – Can store small messages inline – Can store keyword index as well – What about read performance?• Flash/Memory – Expensive – Only metadata
LSM Trees• High write throughput• Recent Data Clustered – Nice! Fits a mailbox access pattern• Inherently Snapshotted – Backups/DR should be easy
Reads?• Write-Optimized => Read-Penalty• Cache working set in App Server – At-Most one App Server per User. – All mailbox updates via Application Server – Serve directly from cache• Cold-Start – LSM tree clustering should make retrieving recent messages/threads fast.
SPOF?• Single Hbase/HDFS cluster?• NO! – Lots of 100 node clusters – HDFS Namenode HA
Cassandra vs. HBase (abridged)• Tested it out (c. 2010) – HBase held up, (FB Internal) Cassandra didn’t• Tried to understand internals – HBase held up, Cassandra didn’t• Really Really trusted HDFS – Stored PB of data for years with no loss• Missing features in Hbase/HDFS can be added
Disaster Recovery (HBase)1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HLog• No need to synchronize #2 and #3 perfectly – HLog replay is idempotent
What about Flash?• In HBase: – Store recent LSM tree segments in Flash – Store HBase block cache – Inefficient in Cassandra! (3x LSM trees/cache)• In App Server – Page /in out User cache from Flash
Lingering Doubts• Small Components vs. Big Systems – Small Components are better – Is HDFS too big? • Separate DataNode, BlockManager, NameNode • HBase doesn’t need NameNode• Gave up on Cross-DC concurrency – Partition Users if required – Global user->DC registry needs to deal with partitions and conflict resolution – TBD
Cassandra: Flat Earth• The world is hierarchical – PCI Bus, Rack, Data Center, Region, Continent .. – Odds of Partitioning differvs.• Symmetric hash ring spanning continents – Odds of partitioning considered constant
Cassandra – No Centralization• The world has central (but HA) tiers: – DNS servers, Core-Switches, Memcache-Tier, …• Cassandra: all servers independent – No authoritative commit log or snapshot – Do Repeat Your Reads (DRYR) paradigm
Philosophies have Consequences• Consistent Reads are expensive – N=3, R=2, W=2 – Ugh: why are reads expensive in write optimized system?• Is Consistency foolproof ? – Edge cases with failed writes – Internet still debating – If Science has Bugs – then imagine Code!
Distributed Storage vs. Database• How to recover failed block or disk?• Distributed Storage (HDFS): – Simple - Find other replicas for that block.• Distributed Database (Cassandra): – A ton of my databases lived on that drive – Hard: Let’s merge all the affected databases
Eventual Consistency• Read-Modify-Write pattern problematic 1. Read value 2. Apply Business Logic 3. Write value Stale Read leads to Junk• What about atomic increments?
Conflict Resolution• Easy to resolve conflicts in Increments• Imagine multi-row transactions – Pointless resolving conflicts at row levelSolve conflicts at highest possible layer – Transaction Monitor
How did it work out?• Ton of missing Hbase/HDFS features added – Bloom Filters, Namenode HA – Remote Hlog shipping – Modified Block Placement Policy – Sticky Regions – Improved Block Cache –…• User -> AppServer via Zookeeper• App Server worked out