0
Messaging Architecture @FB      Joydeep Sen Sarma
Background• WAS:  – key participant design team (10/2009-04/2010)  – vocal proponent of final consensus architecture• WAS ...
Problem• 1Billion Users• Volume:  – 25 messages/day * 4KB (exclude attachments)  – 10 TB per day• Indexes/Summaries:  – Ke...
Problem – Cont.• Must have: Cross continent copy  – Ideally concurrent updates across regions  – At least disaster recover...
Solved Problem• Attachment Store  – HayStack  – Stores FB Photos  – Optimized for Immutable Data• Hiring best programmers ...
Write Throughput• Disk:  – Need Log Structured container  – Can store small messages inline  – Can store keyword index as ...
LSM Trees• High write throughput• Recent Data Clustered   – Nice! Fits a mailbox access pattern• Inherently Snapshotted   ...
Reads?• Write-Optimized => Read-Penalty• Cache working set in App Server  – At-Most one App Server per User.  – All mailbo...
SPOF?• Single Hbase/HDFS cluster?• NO!  – Lots of 100 node clusters  – HDFS Namenode HA
Cassandra vs. HBase (abridged)• Tested it out (c. 2010)  – HBase held up, (FB Internal) Cassandra didn’t• Tried to underst...
Disaster Recovery (HBase)1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HL...
Test!    Try to avoid writing a cache in Java
What about Flash?• In HBase:  – Store recent LSM tree segments in Flash  – Store HBase block cache  – Inefficient in Cassa...
Lingering Doubts• Small Components vs. Big Systems  – Small Components are better  – Is HDFS too big?     • Separate DataN...
Cassandra vs. HBase
Cassandra: Flat Earth• The world is hierarchical  – PCI Bus, Rack, Data Center, Region, Continent ..  – Odds of Partitioni...
Cassandra – No Centralization• The world has central (but HA) tiers:  – DNS servers, Core-Switches, Memcache-Tier, …• Cass...
Philosophies have Consequences• Consistent Reads are expensive  – N=3, R=2, W=2  – Ugh: why are reads expensive in write o...
Distributed Storage vs. Database• How to recover failed block or disk?• Distributed Storage (HDFS):  – Simple - Find other...
Eventual Consistency• Read-Modify-Write pattern problematic  1. Read value  2. Apply Business Logic  3. Write value  Stal...
Conflict Resolution• Easy to resolve conflicts in Increments• Imagine multi-row transactions  – Pointless resolving confli...
How did it work out?• Ton of missing Hbase/HDFS features added  – Bloom Filters, Namenode HA  – Remote Hlog shipping  – Mo...
Upcoming SlideShare
Loading in...5
×

Messaging architecture @FB (Fifth Elephant Conference)

1,403

Published on

These are based on my participation in FB Messaging design in 2010. Software evolves fast - this may not reflect the state of the world today.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,403
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Messaging architecture @FB (Fifth Elephant Conference)"

  1. 1. Messaging Architecture @FB Joydeep Sen Sarma
  2. 2. Background• WAS: – key participant design team (10/2009-04/2010) – vocal proponent of final consensus architecture• WAS NOT: – Part of actual implementation – Involved in FB chat backend• IS NOT: Facebook Employee since 08/2011
  3. 3. Problem• 1Billion Users• Volume: – 25 messages/day * 4KB (exclude attachments) – 10 TB per day• Indexes/Summaries: – Keyword/Threadid/Label Index – Label/Thread message counts
  4. 4. Problem – Cont.• Must have: Cross continent copy – Ideally concurrent updates across regions – At least disaster recoverable• No Single Failure Points for Entire Service – FB has downtime on a few MySql databases/day – No one cares• Cannot Cannot Cannot lose Data
  5. 5. Solved Problem• Attachment Store – HayStack – Stores FB Photos – Optimized for Immutable Data• Hiring best programmers available  – Choose best design, not implementation – But get things done Fast
  6. 6. Write Throughput• Disk: – Need Log Structured container – Can store small messages inline – Can store keyword index as well – What about read performance?• Flash/Memory – Expensive – Only metadata
  7. 7. LSM Trees• High write throughput• Recent Data Clustered – Nice! Fits a mailbox access pattern• Inherently Snapshotted – Backups/DR should be easy
  8. 8. Reads?• Write-Optimized => Read-Penalty• Cache working set in App Server – At-Most one App Server per User. – All mailbox updates via Application Server – Serve directly from cache• Cold-Start – LSM tree clustering should make retrieving recent messages/threads fast.
  9. 9. SPOF?• Single Hbase/HDFS cluster?• NO! – Lots of 100 node clusters – HDFS Namenode HA
  10. 10. Cassandra vs. HBase (abridged)• Tested it out (c. 2010) – HBase held up, (FB Internal) Cassandra didn’t• Tried to understand internals – HBase held up, Cassandra didn’t• Really Really trusted HDFS – Stored PB of data for years with no loss• Missing features in Hbase/HDFS can be added
  11. 11. Disaster Recovery (HBase)1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HLog• No need to synchronize #2 and #3 perfectly – HLog replay is idempotent
  12. 12. Test!  Try to avoid writing a cache in Java
  13. 13. What about Flash?• In HBase: – Store recent LSM tree segments in Flash – Store HBase block cache – Inefficient in Cassandra! (3x LSM trees/cache)• In App Server – Page /in out User cache from Flash
  14. 14. Lingering Doubts• Small Components vs. Big Systems – Small Components are better – Is HDFS too big? • Separate DataNode, BlockManager, NameNode • HBase doesn’t need NameNode• Gave up on Cross-DC concurrency – Partition Users if required – Global user->DC registry needs to deal with partitions and conflict resolution – TBD
  15. 15. Cassandra vs. HBase
  16. 16. Cassandra: Flat Earth• The world is hierarchical – PCI Bus, Rack, Data Center, Region, Continent .. – Odds of Partitioning differvs.• Symmetric hash ring spanning continents – Odds of partitioning considered constant
  17. 17. Cassandra – No Centralization• The world has central (but HA) tiers: – DNS servers, Core-Switches, Memcache-Tier, …• Cassandra: all servers independent – No authoritative commit log or snapshot – Do Repeat Your Reads (DRYR) paradigm
  18. 18. Philosophies have Consequences• Consistent Reads are expensive – N=3, R=2, W=2 – Ugh: why are reads expensive in write optimized system?• Is Consistency foolproof ? – Edge cases with failed writes – Internet still debating – If Science has Bugs – then imagine Code!
  19. 19. Distributed Storage vs. Database• How to recover failed block or disk?• Distributed Storage (HDFS): – Simple - Find other replicas for that block.• Distributed Database (Cassandra): – A ton of my databases lived on that drive – Hard: Let’s merge all the affected databases
  20. 20. Eventual Consistency• Read-Modify-Write pattern problematic 1. Read value 2. Apply Business Logic 3. Write value Stale Read leads to Junk• What about atomic increments?
  21. 21. Conflict Resolution• Easy to resolve conflicts in Increments• Imagine multi-row transactions – Pointless resolving conflicts at row levelSolve conflicts at highest possible layer – Transaction Monitor
  22. 22. How did it work out?• Ton of missing Hbase/HDFS features added – Bloom Filters, Namenode HA – Remote Hlog shipping – Modified Block Placement Policy – Sticky Regions – Improved Block Cache –…• User -> AppServer via Zookeeper• App Server worked out
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×