Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2011 06-30-hadoop-summit v5

37,948 views

Published on

Slides from presentation at Hadoop Summit 2011 on Facebook's Data Freeway system

Published in: Technology
  • Great job, also I've been using a service called MLMRC which automates your downline building. Works with any MLM business. Try it: www.mlmrc.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

2011 06-30-hadoop-summit v5

  1. 1. Data Freeway : Scaling Out to Realtime <ul><li>Eric Hwang, Sam Rash </li></ul><ul><li>{ehwang,rash}@fb.com </li></ul>
  2. 2. Agenda <ul><li>Data at Facebook </li></ul><ul><li>Data Freeway System Overview </li></ul><ul><li>Realtime Requirements </li></ul><ul><li>Realtime Components </li></ul><ul><ul><li>Calligraphus/Scribe </li></ul></ul><ul><ul><li>HDFS use case and modifications </li></ul></ul><ul><ul><li>Calligraphus: a Zookeeper use case </li></ul></ul><ul><ul><li>ptail </li></ul></ul><ul><ul><li>Puma </li></ul></ul><ul><li>Future Work </li></ul>
  3. 3. Big Data, Big Applications / Data at Facebook <ul><li>Lots of data </li></ul><ul><ul><li>more than 500 million active users </li></ul></ul><ul><ul><li>50 million users update their statuses at least once each day </li></ul></ul><ul><ul><li>More than 1 billion photos uploaded each month </li></ul></ul><ul><ul><li>More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week </li></ul></ul><ul><ul><li>Data rate: over 7 GB / second </li></ul></ul><ul><li>Numerous products can leverage the data </li></ul><ul><ul><li>Revenue related: Ads Targeting </li></ul></ul><ul><ul><li>Product/User Growth related: AYML, PYMK, etc </li></ul></ul><ul><ul><li>Engineering/Operation related: Automatic Debugging </li></ul></ul><ul><ul><li>Puma: streaming queries </li></ul></ul>
  4. 4. Data Freeway System Diagram
  5. 5. Realtime Requirements <ul><ul><li>Scalability: 10-15 GBytes/second </li></ul></ul><ul><ul><li>Reliability: No single point of failure </li></ul></ul><ul><ul><li>Data loss SLA: 0.01% </li></ul></ul><ul><ul><ul><li>loss due to hardware: means at most 1 out of 10,000 machines can lose data </li></ul></ul></ul><ul><ul><li>Delay of less than 10 sec for 99% of data </li></ul></ul><ul><ul><ul><li>Typically we see 2s </li></ul></ul></ul><ul><ul><li>Easy to use: as simple as ‘tail –f /var/log/my-log-file’ </li></ul></ul>
  6. 6. Scribe <ul><li>Scalable distributed logging framework </li></ul><ul><li>Very easy to use: </li></ul><ul><ul><li>scribe_log(string category, string message) </li></ul></ul><ul><li>Mechanics: </li></ul><ul><ul><li>Runs on every machine at Facebook </li></ul></ul><ul><ul><li>Built on top of Thrift </li></ul></ul><ul><ul><li>Collect the log data into a bunch of destinations </li></ul></ul><ul><ul><li>Buffer data on local disk if network is down </li></ul></ul><ul><li>History: </li></ul><ul><ul><li>2007: Started at Facebook </li></ul></ul><ul><ul><li>2008 Oct: Open-sourced </li></ul></ul>
  7. 7. Calligraphus <ul><li>What </li></ul><ul><ul><li>Scribe-compatible server written in Java </li></ul></ul><ul><ul><li>emphasis on modular, testable code-base, and performance </li></ul></ul><ul><li>Why? </li></ul><ul><ul><li>extract simpler design from existing Scribe architecture </li></ul></ul><ul><ul><li>cleaner integration with Hadoop ecosystem </li></ul></ul><ul><ul><ul><li>HDFS, Zookeeper, HBase, Hive </li></ul></ul></ul><ul><li>History </li></ul><ul><ul><li>In production since November 2010 </li></ul></ul><ul><ul><li>Zookeeper integration since March 2011 </li></ul></ul>
  8. 8. HDFS : a different use case <ul><li>message hub </li></ul><ul><ul><li>add concurrent reader support and sync </li></ul></ul><ul><ul><li>writers + concurrent readers a form of pub/sub model </li></ul></ul>
  9. 9. HDFS : add Sync <ul><li>Sync </li></ul><ul><ul><li>implement in 0.20 (HDFS-200) </li></ul></ul><ul><ul><ul><li>partial chunks are flushed </li></ul></ul></ul><ul><ul><ul><li>blocks are persisted </li></ul></ul></ul><ul><ul><li>provides durability </li></ul></ul><ul><ul><li>lowers write-to-read latency </li></ul></ul>
  10. 10. HDFS : Concurrent Reads Overview <ul><li>Without changes, stock Hadoop 0.20 does not allow access to the block being written </li></ul><ul><li>Need to read the block being written for realtime apps in order to achieve < 10s latency </li></ul>
  11. 11. HDFS : Concurrent Reads Implementation <ul><li>DFSClient asks Namenode for blocks and locations </li></ul><ul><li>DFSClient asks Datanode for length of block being written </li></ul><ul><li>opens last block </li></ul>
  12. 12. HDFS : Checksum Problem <ul><li>Issue: data and checksum updates are not atomic for last chunk </li></ul><ul><li>0.20-append fix: </li></ul><ul><ul><li>detect when data is out of sync with checksum using a visible length </li></ul></ul><ul><ul><li>recompute checksum on the fly </li></ul></ul><ul><li>0.22 fix </li></ul><ul><ul><li>last chunk data and checksum kept in memory for reads </li></ul></ul>
  13. 13. Calligraphus: Log Writer Calligraphus Servers HDFS Scribe categories Server Server Server Category 1 Category 2 Category 3 <ul><ul><li>How to persist to HDFS? </li></ul></ul>
  14. 14. Calligraphus (Simple) Calligraphus Servers HDFS Scribe categories Number of categories Number of servers Total number of directories x = Server Server Server Category 1 Category 2 Category 3
  15. 15. Calligraphus (Stream Consolidation) Calligraphus Servers HDFS Scribe categories Number of categories Total number of directories = Category 1 Category 2 Category 3 Router Router Router Writer Writer Writer ZooKeeper
  16. 16. ZooKeeper: Distributed Map <ul><li>Design </li></ul><ul><ul><li>ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>) </li></ul></ul><ul><ul><li>Cannonical ZooKeeper leader elections under each bucket for bucket ownership </li></ul></ul><ul><ul><li>Independent load management – leaders can release tasks </li></ul></ul><ul><ul><li>Reader-side caches </li></ul></ul><ul><ul><li>Frequent sync with policy db </li></ul></ul>A 1 5 2 3 4 B 1 5 2 3 4 C 1 5 2 3 4 D 1 5 2 3 4 Root
  17. 17. ZooKeeper: Distributed Map <ul><li>Real-time Properties </li></ul><ul><ul><li>Highly available </li></ul></ul><ul><ul><li>No centralized control </li></ul></ul><ul><ul><li>Fast mapping lookups </li></ul></ul><ul><ul><li>Quick failover for writer failures </li></ul></ul><ul><ul><li>Adapts to new categories and changing throughput </li></ul></ul>
  18. 18. Distributed Map: Performance Summary <ul><li>Bootstrap (~3000 categories) </li></ul><ul><ul><li>Full election participation in 30 seconds </li></ul></ul><ul><ul><li>Identify all election winners in 5-10 seconds </li></ul></ul><ul><ul><li>Stable mapping converges in about three minutes </li></ul></ul><ul><li>Election or failure response usually <1 second </li></ul><ul><ul><li>Worst case bounded in tens of seconds </li></ul></ul>
  19. 19. Canonical Realtime Application <ul><li>Examples </li></ul><ul><ul><li>Realtime search indexing </li></ul></ul><ul><ul><li>Site integrity: spam detection </li></ul></ul><ul><ul><li>Streaming metrics </li></ul></ul>
  20. 20. Parallel Tailer <ul><li>Why? </li></ul><ul><ul><li>Access data in 10 seconds or less </li></ul></ul><ul><ul><li>Data stream interface </li></ul></ul><ul><li>Command-line tool to tail the log </li></ul><ul><ul><li>Easy to use: ptail -f cat1 </li></ul></ul><ul><ul><li>Support checkpoint: ptail -cp XXX cat1 </li></ul></ul>
  21. 21. Canonical Realtime ptail Application
  22. 22. Puma Overview <ul><li>realtime analytics platform </li></ul><ul><li>metrics </li></ul><ul><ul><li>count, sum, unique count, average, percentile </li></ul></ul><ul><li>uses ptail checkpointing for accurate calculations in the case of failure </li></ul><ul><li>Puma nodes are sharded by keys in the input stream </li></ul><ul><li>HBase for persistence </li></ul>
  23. 23. Puma Write Path
  24. 24. Puma Read Path
  25. 25. Summary - Data Freeway <ul><li>Highlights: </li></ul><ul><ul><li>Scalable: 4G-5G Bytes/Second </li></ul></ul><ul><ul><li>Reliable: No single-point of failure; < 0.01% data loss with hardware failures </li></ul></ul><ul><ul><li>Realtime: delay < 10 sec (typically 2s) </li></ul></ul><ul><li>Open-Source </li></ul><ul><ul><li>Scribe, HDFS </li></ul></ul><ul><ul><li>Calligraphus/Continuous Copier/Loader/ptail (pending) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Realtime Analytics </li></ul></ul><ul><ul><li>Search/Feed </li></ul></ul><ul><ul><li>Spam Detection/Ads Click Prediction (in the future) </li></ul></ul>
  26. 26. Future Work <ul><li>Puma </li></ul><ul><ul><li>Enhance functionality: add application-level transactions on Hbase </li></ul></ul><ul><ul><li>Streaming SQL interface </li></ul></ul><ul><li>Seekable Compression format </li></ul><ul><ul><li>for large categories, the files are 400-500 MB </li></ul></ul><ul><ul><li>need an efficient way to get to the end of the stream </li></ul></ul><ul><ul><li>Simple Seekable Format </li></ul></ul><ul><ul><ul><li>container with compressed/uncompressed stream offsets </li></ul></ul></ul><ul><ul><ul><li>contains data segments which are independent virtual files </li></ul></ul></ul>
  27. 27. Fin <ul><li>Questions? </li></ul>

×