2011 06-30-hadoop-summit v5


Published on

Slides from presentation at Hadoop Summit 2011 on Facebook's Data Freeway system

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Sam (35s) -Who am I? -Worked in batch computing and distributed systems for over 10 years at both research-related facilities and in the internet industry -presently work at facebook; have worked on HDFS, scribe, calligraphus, and realtime metrics Eric (?s)
  • -We’ll give you a little context of how data fits into facebook -also what pieces of datafreeway enable realtime computations (20s)
  • -we have a lot of data at facebook -some of these stats are probably old (6 months old) -accurate one: we handle about 250 TB of data per day with scribe-hdfs, and probably close to 300 TB total -we have both batch and realtime uses of data -we’ll focus on the components of datafreeway that enable realtime use (35s)
  • Entry points: A1. tfb www php code (flib/core/scribe/client.php, or use Nectar) A2. binary/script that sends data to scribe directly (fbcode/scribe/if/scribe.thrift with TFramedProtocol to localhost:1456) A* : all data comes in the form category X message Policy System -manages quotas Realtime B1. raw Data Streams: ptail B2. Realtime Analytics: flagship realtime app; ptail -> Puma/HBase Batch B3. Reporting and Batch Processing: Hadoop/Hive Cluster (Daily tables and Current Tables) (1m30s)
  • Realtime requirements Scalable: also ease of operations Reliable: Data loss sla – this means bound loss due to hw failure Fast: Data latency sla – again, in the face of slow and failing hardware Easy to use: ptail –f –time (1m30s)
  • -scribe is reliable: handles buffering when scribeh is down -really easy to use due to being on every host + language agnostic thrift api -major reason we are doing 7 GB/s (40s)
  • what -in a nut-shell, calligraphus is scribe-compatible server that logs to HDFS - why -decided on re-writing just part of scribe for scribeh; java made more sense -libhdfs and JNI has had not only memory leaks, but improper error return codes; ‘exists’ : not there and error both -1 Status -Hope to open source soon (1m)
  • -this is a different use of HDFS (not a batch system) -think of each entity as a publisher of data (say they are web hosts) -the consumer of a ptail output stream is a subscriber -writer tags what it publishes with category 1, reader asks for all messages tagged with category 1 -this way, you get both a low-latency pub-sub system and reliable persistence on HDFS! (45-60s)
  • -how do we get to HDFS as a message hub? -we first need to make sure clients can have a guarantee writes are persisted and available for read -the fsync call only happens the first sync() call; the blocks need to be persisted only once so we don ’t hit the Namenode on every client sync all -return from sync means : blocks persisted and every datanode has written the block to disk (may be in OS buffers though) -typically sync is very expensive. Experiments with HBase write-ahead-log show 50% drop in throughput with frequent sync calls (1m)
  • -with sync, next step is to make blocks being written available for read; -recall that one of our realtime requirements is 10s latency for reading data -typical block size is 512MB -4 GB/s is split out across 1000s of files; compression -without concurrent reads, data would not be visible for 1 hour (when we roll files) -concurrent reads essential to satisfy our realtime requirements (1m)
  • -updated FSNameSystem so that Namenode will return the targets for any file that is currently being written -updated DFSClient so that if it tries to read the block being written, gets the length from the datanode -can then start reading the data up to the length at the time datanode was queried -note: one problem here, visible length per datanode can vary, so if a datanode goes down, it ’s possible to get a different datanode with a shorter length; especially true in the case of lease recovery (truncation) -solved in 0.22 with visible length (sort of) (1m15s)
  • -the issue here is that data and metadata can be out of sync for last 512 byte chunk -define visible length is the length of data for which metadata exists and track in memory -if we read more data than that, we know we will get a CRC error and can therefore compute the CRC to send to the client -note: we only do the on-the-fly CRC in the strictest of situation so we don ’t regress and miss disk errors -trunk solution is more elegant and was done afterwards; possible to back-port (1m)
  • Calligraphus’ role in the Datafreeway pipeline is to take incoming data streams and persist them to HDFS Many thousands of client hosts generating log data for a number of categories and are then delivered to some randomly selected calligraphus servers => load balancing purposes Since ratio of clients to servers is very high, we can safely assume every server will receive roughly every category Question is in what way should we persist this to disk? In our system every stream written to HDFS needs to corresponds to a directory in which data is appended in the form of files To maintain proper stream semantics for downstream components, which will follow each directory as an independent stream. Thus, should avoid having multiple writers for one directory (a.k.a stream)
  • A really simple solution is just to have every writer write each category stream independently. Problem with independent writer approach: RESOURCE INEFFICIENT AND UNSCALABLE! Number of output streams is a approximately the product of # categories and # servers! Scaling to more machines or categories takes its toll on the name node (resource bottleneck) as well as down stream components that need to read these streams on a per category basis
  • This is the approach that we took. A more suitable solution is to do data stream consolidation to reduce the number of output streams. We do this by breaking the Calligraphus servers into two logical components: a router and writer. Then we add add an intermediate shuffle phase between the router and writer tier before writing to HDFS in order to consolidate streams. Calligraphus writers are assigned category streams or portions of category streams (for large streams) to be written. Based on this assignment, routers direct data streams to the appropriate writers This drastically reduces the number of output data streams and minimizes all of the problems we mentioned in the last slide. ZooKeeper is the core component in facilitating router-writer interactions, serving as a distributed map for routers and as a stream assignment platform for writers.
  • ZooKeeper can be viewed as a type of light-weight distributed file system. You can put and get data from a hierarchy namespace of nodes that can be accessed like file system paths. The paths that we define consist of a category/bucket pair and these serve as the root nodes for leader elections. We use buckets in addition to categories in order to partition category streams that are too large to handle for one writer. We run leader elections under each of these paths to determine task assignment. Since elections winners may not always have enough capacity, election winners can choose to reject leadership either immediately or shed if off in the future if load changes Aggressive reader side caches to minimize ZooKeeper network IO for queries. This is ok because mappings do not frequently change after stabilizing. Policy db sync propagates new categories into the maps and adjusts numbers of buckets (on the fly)
  • Highly available is an inherent property of ZooKeeper. We can lose a few ZooKeeper nodes and still keep going. No centralized authority (independent elections, writers independently manage load) => no single point of failure Fast map lookups a result of client caching Failover => another property of using ZooKeeper’s ephemeral nodes for leader elections Adapt to changing conditions by adding new election root nodes for new categories or by adjusting number of buckets for a category
  • We can enter about 3000 elections in just under 30 seconds routers can run elections to find leaders on all buckets in about 5-10 seconds Stable configuration it takes a bit of time since we have slow start phase to balance load, and writers need to have a little time after each bucket acquisition to determine their load -- but this is ok for our use case since in the long run the mapping is stable. If no mapping exists for some data stream, we buffer the data until it is defined. Later on during operation, we can dandles incremental changes to mapping very quickly. We can respond to election events or failures in less than a second.
  • -Servers log data to NFS filer in a way that consolidates the data into one or more files -tailer app then reads the data -examples: all need data in a timely fashion (30s)
  • 2 key points Hides the fact we have many HDFS instances: user can specify a category and get a stream Checkpointing -high value of generalization of checkpoint mechanism (used in puma) (45s)
  • Configure servers local scribed to write to scribeh Client uses ptail app to get an aggregated log stream in realtime (25s)
  • -the goal of Puma is to provide a configurable realtime analytics platform. -customers can setup pipelines here similar to with Hive, but get the data in realtime -puma is in fact a canonical streaming app in how it uses ptail -we also leverage HBase for persistence (30s)
  • Write Flow: Ptail delivers log lines at up to 600,000 per second Driver includes a parser and processor that filters lines and sends appropriate lines to the aggregation store Aggregation Store will update any metrics based on the parsed entry If the Driver sees a checkpoint line, it instead passes it to the checkpoint handler The checkpoint handler decides if it it should flush; if so, tells Aggregation to persist changes to Storage and then writes checkpoint data to storage Notes * Storage is an interface; have memory and Hbase; can change to other forms such as MySQL (1m15s)
  • Read path Client makes request to a thrift server: Sever proxies the request to the Store implementation which queries HBase Performance Elapsed time typically 200-300 ms for 30 day queries 99 th percentile, cross-country, < 500ms for 30 day queries (1m)
  • -scribe & calligraphus get data into the system -HDFS at the core -ptail provides data out -puma is our emerging streaming analytics platform (20s)
  • -puma: basically implemented a shared write-ahead-log in an Hbase table for our specific use case -using LZO or GZIP, we can ’t seek to the end of the uncompressed stream -solution: provide a basic container structure with information about compressed/uncompressed blocks. -each 1M block contains a header with a list of compressed/uncompressed offset in stream (1m)
  • 2011 06-30-hadoop-summit v5

    1. 1. Data Freeway : Scaling Out to Realtime <ul><li>Eric Hwang, Sam Rash </li></ul><ul><li>{ehwang,rash}@fb.com </li></ul>
    2. 2. Agenda <ul><li>Data at Facebook </li></ul><ul><li>Data Freeway System Overview </li></ul><ul><li>Realtime Requirements </li></ul><ul><li>Realtime Components </li></ul><ul><ul><li>Calligraphus/Scribe </li></ul></ul><ul><ul><li>HDFS use case and modifications </li></ul></ul><ul><ul><li>Calligraphus: a Zookeeper use case </li></ul></ul><ul><ul><li>ptail </li></ul></ul><ul><ul><li>Puma </li></ul></ul><ul><li>Future Work </li></ul>
    3. 3. Big Data, Big Applications / Data at Facebook <ul><li>Lots of data </li></ul><ul><ul><li>more than 500 million active users </li></ul></ul><ul><ul><li>50 million users update their statuses at least once each day </li></ul></ul><ul><ul><li>More than 1 billion photos uploaded each month </li></ul></ul><ul><ul><li>More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week </li></ul></ul><ul><ul><li>Data rate: over 7 GB / second </li></ul></ul><ul><li>Numerous products can leverage the data </li></ul><ul><ul><li>Revenue related: Ads Targeting </li></ul></ul><ul><ul><li>Product/User Growth related: AYML, PYMK, etc </li></ul></ul><ul><ul><li>Engineering/Operation related: Automatic Debugging </li></ul></ul><ul><ul><li>Puma: streaming queries </li></ul></ul>
    4. 4. Data Freeway System Diagram
    5. 5. Realtime Requirements <ul><ul><li>Scalability: 10-15 GBytes/second </li></ul></ul><ul><ul><li>Reliability: No single point of failure </li></ul></ul><ul><ul><li>Data loss SLA: 0.01% </li></ul></ul><ul><ul><ul><li>loss due to hardware: means at most 1 out of 10,000 machines can lose data </li></ul></ul></ul><ul><ul><li>Delay of less than 10 sec for 99% of data </li></ul></ul><ul><ul><ul><li>Typically we see 2s </li></ul></ul></ul><ul><ul><li>Easy to use: as simple as ‘tail –f /var/log/my-log-file’ </li></ul></ul>
    6. 6. Scribe <ul><li>Scalable distributed logging framework </li></ul><ul><li>Very easy to use: </li></ul><ul><ul><li>scribe_log(string category, string message) </li></ul></ul><ul><li>Mechanics: </li></ul><ul><ul><li>Runs on every machine at Facebook </li></ul></ul><ul><ul><li>Built on top of Thrift </li></ul></ul><ul><ul><li>Collect the log data into a bunch of destinations </li></ul></ul><ul><ul><li>Buffer data on local disk if network is down </li></ul></ul><ul><li>History: </li></ul><ul><ul><li>2007: Started at Facebook </li></ul></ul><ul><ul><li>2008 Oct: Open-sourced </li></ul></ul>
    7. 7. Calligraphus <ul><li>What </li></ul><ul><ul><li>Scribe-compatible server written in Java </li></ul></ul><ul><ul><li>emphasis on modular, testable code-base, and performance </li></ul></ul><ul><li>Why? </li></ul><ul><ul><li>extract simpler design from existing Scribe architecture </li></ul></ul><ul><ul><li>cleaner integration with Hadoop ecosystem </li></ul></ul><ul><ul><ul><li>HDFS, Zookeeper, HBase, Hive </li></ul></ul></ul><ul><li>History </li></ul><ul><ul><li>In production since November 2010 </li></ul></ul><ul><ul><li>Zookeeper integration since March 2011 </li></ul></ul>
    8. 8. HDFS : a different use case <ul><li>message hub </li></ul><ul><ul><li>add concurrent reader support and sync </li></ul></ul><ul><ul><li>writers + concurrent readers a form of pub/sub model </li></ul></ul>
    9. 9. HDFS : add Sync <ul><li>Sync </li></ul><ul><ul><li>implement in 0.20 (HDFS-200) </li></ul></ul><ul><ul><ul><li>partial chunks are flushed </li></ul></ul></ul><ul><ul><ul><li>blocks are persisted </li></ul></ul></ul><ul><ul><li>provides durability </li></ul></ul><ul><ul><li>lowers write-to-read latency </li></ul></ul>
    10. 10. HDFS : Concurrent Reads Overview <ul><li>Without changes, stock Hadoop 0.20 does not allow access to the block being written </li></ul><ul><li>Need to read the block being written for realtime apps in order to achieve < 10s latency </li></ul>
    11. 11. HDFS : Concurrent Reads Implementation <ul><li>DFSClient asks Namenode for blocks and locations </li></ul><ul><li>DFSClient asks Datanode for length of block being written </li></ul><ul><li>opens last block </li></ul>
    12. 12. HDFS : Checksum Problem <ul><li>Issue: data and checksum updates are not atomic for last chunk </li></ul><ul><li>0.20-append fix: </li></ul><ul><ul><li>detect when data is out of sync with checksum using a visible length </li></ul></ul><ul><ul><li>recompute checksum on the fly </li></ul></ul><ul><li>0.22 fix </li></ul><ul><ul><li>last chunk data and checksum kept in memory for reads </li></ul></ul>
    13. 13. Calligraphus: Log Writer Calligraphus Servers HDFS Scribe categories Server Server Server Category 1 Category 2 Category 3 <ul><ul><li>How to persist to HDFS? </li></ul></ul>
    14. 14. Calligraphus (Simple) Calligraphus Servers HDFS Scribe categories Number of categories Number of servers Total number of directories x = Server Server Server Category 1 Category 2 Category 3
    15. 15. Calligraphus (Stream Consolidation) Calligraphus Servers HDFS Scribe categories Number of categories Total number of directories = Category 1 Category 2 Category 3 Router Router Router Writer Writer Writer ZooKeeper
    16. 16. ZooKeeper: Distributed Map <ul><li>Design </li></ul><ul><ul><li>ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>) </li></ul></ul><ul><ul><li>Cannonical ZooKeeper leader elections under each bucket for bucket ownership </li></ul></ul><ul><ul><li>Independent load management – leaders can release tasks </li></ul></ul><ul><ul><li>Reader-side caches </li></ul></ul><ul><ul><li>Frequent sync with policy db </li></ul></ul>A 1 5 2 3 4 B 1 5 2 3 4 C 1 5 2 3 4 D 1 5 2 3 4 Root
    17. 17. ZooKeeper: Distributed Map <ul><li>Real-time Properties </li></ul><ul><ul><li>Highly available </li></ul></ul><ul><ul><li>No centralized control </li></ul></ul><ul><ul><li>Fast mapping lookups </li></ul></ul><ul><ul><li>Quick failover for writer failures </li></ul></ul><ul><ul><li>Adapts to new categories and changing throughput </li></ul></ul>
    18. 18. Distributed Map: Performance Summary <ul><li>Bootstrap (~3000 categories) </li></ul><ul><ul><li>Full election participation in 30 seconds </li></ul></ul><ul><ul><li>Identify all election winners in 5-10 seconds </li></ul></ul><ul><ul><li>Stable mapping converges in about three minutes </li></ul></ul><ul><li>Election or failure response usually <1 second </li></ul><ul><ul><li>Worst case bounded in tens of seconds </li></ul></ul>
    19. 19. Canonical Realtime Application <ul><li>Examples </li></ul><ul><ul><li>Realtime search indexing </li></ul></ul><ul><ul><li>Site integrity: spam detection </li></ul></ul><ul><ul><li>Streaming metrics </li></ul></ul>
    20. 20. Parallel Tailer <ul><li>Why? </li></ul><ul><ul><li>Access data in 10 seconds or less </li></ul></ul><ul><ul><li>Data stream interface </li></ul></ul><ul><li>Command-line tool to tail the log </li></ul><ul><ul><li>Easy to use: ptail -f cat1 </li></ul></ul><ul><ul><li>Support checkpoint: ptail -cp XXX cat1 </li></ul></ul>
    21. 21. Canonical Realtime ptail Application
    22. 22. Puma Overview <ul><li>realtime analytics platform </li></ul><ul><li>metrics </li></ul><ul><ul><li>count, sum, unique count, average, percentile </li></ul></ul><ul><li>uses ptail checkpointing for accurate calculations in the case of failure </li></ul><ul><li>Puma nodes are sharded by keys in the input stream </li></ul><ul><li>HBase for persistence </li></ul>
    23. 23. Puma Write Path
    24. 24. Puma Read Path
    25. 25. Summary - Data Freeway <ul><li>Highlights: </li></ul><ul><ul><li>Scalable: 4G-5G Bytes/Second </li></ul></ul><ul><ul><li>Reliable: No single-point of failure; < 0.01% data loss with hardware failures </li></ul></ul><ul><ul><li>Realtime: delay < 10 sec (typically 2s) </li></ul></ul><ul><li>Open-Source </li></ul><ul><ul><li>Scribe, HDFS </li></ul></ul><ul><ul><li>Calligraphus/Continuous Copier/Loader/ptail (pending) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Realtime Analytics </li></ul></ul><ul><ul><li>Search/Feed </li></ul></ul><ul><ul><li>Spam Detection/Ads Click Prediction (in the future) </li></ul></ul>
    26. 26. Future Work <ul><li>Puma </li></ul><ul><ul><li>Enhance functionality: add application-level transactions on Hbase </li></ul></ul><ul><ul><li>Streaming SQL interface </li></ul></ul><ul><li>Seekable Compression format </li></ul><ul><ul><li>for large categories, the files are 400-500 MB </li></ul></ul><ul><ul><li>need an efficient way to get to the end of the stream </li></ul></ul><ul><ul><li>Simple Seekable Format </li></ul></ul><ul><ul><ul><li>container with compressed/uncompressed stream offsets </li></ul></ul></ul><ul><ul><ul><li>contains data segments which are independent virtual files </li></ul></ul></ul>
    27. 27. Fin <ul><li>Questions? </li></ul>