Successfully reported this slideshow.

Messaging architecture @FB (Fifth Elephant Conference)


Published on

These are based on my participation in FB Messaging design in 2010. Software evolves fast - this may not reflect the state of the world today.

Published in: Technology
  • Be the first to comment

Messaging architecture @FB (Fifth Elephant Conference)

  1. 1. Messaging Architecture @FB Joydeep Sen Sarma
  2. 2. Background• WAS: – key participant design team (10/2009-04/2010) – vocal proponent of final consensus architecture• WAS NOT: – Part of actual implementation – Involved in FB chat backend• IS NOT: Facebook Employee since 08/2011
  3. 3. Problem• 1Billion Users• Volume: – 25 messages/day * 4KB (exclude attachments) – 10 TB per day• Indexes/Summaries: – Keyword/Threadid/Label Index – Label/Thread message counts
  4. 4. Problem – Cont.• Must have: Cross continent copy – Ideally concurrent updates across regions – At least disaster recoverable• No Single Failure Points for Entire Service – FB has downtime on a few MySql databases/day – No one cares• Cannot Cannot Cannot lose Data
  5. 5. Solved Problem• Attachment Store – HayStack – Stores FB Photos – Optimized for Immutable Data• Hiring best programmers available  – Choose best design, not implementation – But get things done Fast
  6. 6. Write Throughput• Disk: – Need Log Structured container – Can store small messages inline – Can store keyword index as well – What about read performance?• Flash/Memory – Expensive – Only metadata
  7. 7. LSM Trees• High write throughput• Recent Data Clustered – Nice! Fits a mailbox access pattern• Inherently Snapshotted – Backups/DR should be easy
  8. 8. Reads?• Write-Optimized => Read-Penalty• Cache working set in App Server – At-Most one App Server per User. – All mailbox updates via Application Server – Serve directly from cache• Cold-Start – LSM tree clustering should make retrieving recent messages/threads fast.
  9. 9. SPOF?• Single Hbase/HDFS cluster?• NO! – Lots of 100 node clusters – HDFS Namenode HA
  10. 10. Cassandra vs. HBase (abridged)• Tested it out (c. 2010) – HBase held up, (FB Internal) Cassandra didn’t• Tried to understand internals – HBase held up, Cassandra didn’t• Really Really trusted HDFS – Stored PB of data for years with no loss• Missing features in Hbase/HDFS can be added
  11. 11. Disaster Recovery (HBase)1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HLog• No need to synchronize #2 and #3 perfectly – HLog replay is idempotent
  12. 12. Test!  Try to avoid writing a cache in Java
  13. 13. What about Flash?• In HBase: – Store recent LSM tree segments in Flash – Store HBase block cache – Inefficient in Cassandra! (3x LSM trees/cache)• In App Server – Page /in out User cache from Flash
  14. 14. Lingering Doubts• Small Components vs. Big Systems – Small Components are better – Is HDFS too big? • Separate DataNode, BlockManager, NameNode • HBase doesn’t need NameNode• Gave up on Cross-DC concurrency – Partition Users if required – Global user->DC registry needs to deal with partitions and conflict resolution – TBD
  15. 15. Cassandra vs. HBase
  16. 16. Cassandra: Flat Earth• The world is hierarchical – PCI Bus, Rack, Data Center, Region, Continent .. – Odds of Partitioning differvs.• Symmetric hash ring spanning continents – Odds of partitioning considered constant
  17. 17. Cassandra – No Centralization• The world has central (but HA) tiers: – DNS servers, Core-Switches, Memcache-Tier, …• Cassandra: all servers independent – No authoritative commit log or snapshot – Do Repeat Your Reads (DRYR) paradigm
  18. 18. Philosophies have Consequences• Consistent Reads are expensive – N=3, R=2, W=2 – Ugh: why are reads expensive in write optimized system?• Is Consistency foolproof ? – Edge cases with failed writes – Internet still debating – If Science has Bugs – then imagine Code!
  19. 19. Distributed Storage vs. Database• How to recover failed block or disk?• Distributed Storage (HDFS): – Simple - Find other replicas for that block.• Distributed Database (Cassandra): – A ton of my databases lived on that drive – Hard: Let’s merge all the affected databases
  20. 20. Eventual Consistency• Read-Modify-Write pattern problematic 1. Read value 2. Apply Business Logic 3. Write value Stale Read leads to Junk• What about atomic increments?
  21. 21. Conflict Resolution• Easy to resolve conflicts in Increments• Imagine multi-row transactions – Pointless resolving conflicts at row levelSolve conflicts at highest possible layer – Transaction Monitor
  22. 22. How did it work out?• Ton of missing Hbase/HDFS features added – Bloom Filters, Namenode HA – Remote Hlog shipping – Modified Block Placement Policy – Sticky Regions – Improved Block Cache –…• User -> AppServer via Zookeeper• App Server worked out