Good morning everyone. My name’s Zheng Shao.Today I am going to talk about the Real-time Analytics at Facebook.
This is the agenda of the talk. We will start with why we need realtime analytics, then get into details of how we implemented it, and finally future works and comparisons with other systems.
First of all, what is realtime analytics and why we want to do it.
This is the main use case for our analytics. We have a product called Facebook Insights, which allows website owners, advertisers, Facebook application developers, and Facebook page owners to view the time series of impression/click/action counters, the counters broken down by demographics like gender and age, as well as the unique user counters and heavy hitters like most popular urls.The major challenges of building the backend of this insights products are two folds. On one hand, we have huge amount of data coming from both Facebook and non-Facebook websites. On the other hand, customers of the insights product really want to have low-latency summaries, so that they can immediately know how popular a new article or a new game is.
We didhave an existing complete data warehouse solution at Facebook to handle insights workload.In short, log streams got generated from HTTP servers and transferred to NFS via a log collection framework called scribe all within seconds, and then got copied/loaded into Hadoop. Summaries got generated from daily pipeline jobs and eventually got loaded into MySQL for serving.Specifically, we have a 3000-note Hadoop cluster to handle the scalability issue. Copier/Loader are map-reduce jobs which handle machine failures automatically. And Pipeline Jobs are written in Hive which has a SQL-like syntax.Pretty good scalability until we hit the data center power limit. But latency is terrible.
We got 2 ideas on how to improve the latency.The first one is small-batch processing. Instead of using a batch of 1 day, we can produce much smaller batches. The question is how to reduce per-batch overhead, so that tiny batches like 1 min or less makes sense.The second one is stream processing. We can aggregate the data as soon as it arrives. This will produce near realtime results. The question is how to make the system reliable against hardware failures.It turns out the per-batch overhead of Map-Reduce is so high that it’s not practical to have even 5-minute batches on our Hadoop cluster, so we finally decided to go stream processing.
The rest of the talk will focus on two key systems that we built for realtime analytics.The first one, Data Freeway, is a scalable data stream framework on top of Scribe and HDFS.The second one, Puma, is a reliable stream aggregation engine on top of HBase.
This was our old data stream framework. It has several layers of data transportation. The first transport from clients to mid-tier is to reduce the fanout from tens of thousands to hundreds, the second transport is to shuffle the data based on log categories, so that one log category goes to a single writer. Then log data gets written into NFS, which is consumed by batch copier, as well as unix tail/fopen.In short, it’s a simple push/RPC-based logging system. Scribe was open-sourced in 2008, when we have 100 log categories at that time. It quickly got adopted by a lot of other companies. The routing is driven by static configuration which is flexible but have two problems: 1. not scalable because we need to maintain a config for each box in the writers, and a single writer is not scalable; 2. single point of failure in writers.
We came up with Data Freeway in 2011. Right now it’s handling 9GB/sec of data at peak with 10 sec end-to-end latency, and has over 2500 log categories.It contains 4 major components.The first one is scribe. It’s used only at the client, responsible for sending out data via RPCs. The second one is called Calligraphus. It utilizes Zookeeper to manage the ownership of categories, shuffles the data and write to HDFS.The third one is called Continuous Copier, which continuously copies files from one HDFS to another, as the file grows.The fourth one is called PTail, which in parallel tails multiple directories on HDFS and writes out to stdout. Right now we directly ptail from the HDFS written by Calligraphus, but we plan to tail from the HDFS written by Continuous Copier in the future.Let’s get into details of these components.
Calligraphus is responsible for getting log data from RPC and write to File System.Each log category is represented by 1 or more FS directories.Each directory is an ordered list of files, with date in the file name. The files can be compressed.This is a very simple protocol for storing log data. Probably the simplest that I can think of.The most interesting feature about Calligraphus is the bucketing support.We have application buckets, which are application-defined shards. These are used for sharded log consumers. Most of the big log consumers are sharded because their log stream is too big.We also support infrastructure buckets, which allow a single application bucket to have a throughput from several bytes per second to several gigabytes per second. Each infrastructure bucket is a directory. So big streams can go to multiple directories at the same time.Calligraphus has a pretty high performance. We call File System sync every 7 seconds, which is the major source of data latency right now. The network throughput can easily saturate 1Gbit NIC, and we are planning to use 10Gbit NIC some time soon.
Continuous Copier is for continuous data transfer from one File System to another.Compared with the batch-based map-reduce copier, it provide much lower latency as well as smooth network usage.Right now it’s implemented as a long-running map-only job, but it can be easily moved to any simple job scheduling system other than map-reduce.Right now it uses lock files in HDFS for coordination among different nodes, and we plan to move to Zookeeper very soon.The peak throughput of continuous copier in production is about 3GB/sec compressed right now.
The last component in Data Freeway is PTail, which transfers data from a File System to an output stream.The key feature of PTail is the checkpoint. A PTail checkpoint contains the current files and the file offsets in each of the directories. This makes it possible for PTail to roll back to an earlier checkpoint, and reproduce the data stream without any data loss/duplicates at the boundary.
To wrap up Data Freeway, we support 2 channels for data transfers.Push via RPC has lower latency, can potentially have some loss/dups when network has a problem, is less robust with respect to machine failures, and has a very low complexity in code.Pull via FS has a longer latency, but it does not have any loss/dups, and is robust to machine failures. The problems is that the code of the File System, especially HDFS, can be pretty complex, and we still need to identify and fix some bugs there.Data Freeway consists of 4 components that allows data transfer between these 2 channels.
This is the simplified architecture of a typical stream aggregation engine.Log streams get aggregated on a set of machines. The summaries is usually saved to storage for persistence. Online serving get summaries from either the aggregations directly or from the storage. Usually the write throughput is much higher than the read, because analytics data is only viewed by the owners of the website, e.g.In our environment, we have on the order of 1M log lines per second. For each of the log lines, we need to do multiple group-by operations, like by age, or by gender. The first key in group by is always time/date-related which means the summaries will become static after some time. Also we need to support complex aggregations like unique counts and heavy hitters.
Let’s look at our storage choices first.We considered using either MySQL or HBase as our storage engine. HBase is much easier to manage in a distributed environment, which was the major reason that we chose HBase. It also has better write efficiency as well as Columnar support. The read efficiency is inferior because HBase’s cache has less memory space efficiency.
The first architecture that we came up is called Puma2.We run Puma2 on a set of machines, and use PTail to provide parallel data streams. For each log line, Puma2 issues “increment” operations to HBase. Note that Puma2 servers are all symmetric, which means the same row in HBase can be incremented by multiple Puma2 at the same time.HBase can do single increment operation on multiple columns of the same row. So we can use a single increment operation in HBase to handle multiple Group-By’s.Puma2 went into production in March 2011 and is handling 600K log lines on 100 boxes (Puma2 + HBase)
Here are the pros and cons of the Puma2 architecture. The good thing about Puma2 is extremely simple and easy to maintain. The root reason is that Puma2 servers are symmetric and almost stateless. The only state is the PTail checkpoint that is saved to HBase periodically. As a result, we can easily add more boxes or reboot a box if the box went down.However, Puma2 also has its problems. First of all, HBase increment operation is expensive because it’s a read-and-write, and read is expensive. It’s also not possible to support aggregations other than counts, because that need a lot of customized code in HBase. We did a hacky implementation of “most frequent elements” by multiple layers of “frequent element table”. Finally, Puma2 can have small data duplicates because “increments” and checkpoint writes are not in a single transaction.
We did some small improvements to Puma2.On the Puma2 service, an obvious idea is to batch the increment requests to reduce the load on HBase. However, it didn’t work well because of the long-tail distribution of Group-By keys. It also made data less accurate because we cannot save checkpoints in the middle of a batch.On the HBase side, we first optimized the “increment” operation by reducing the number of locks. Another big efficiency improvement came from the short-circuited read from HBase directly to HDFS block files on the disk, instead of via DataNode daemon. We also improved the HBase reliability under the high load.All in all, we are still not happy about Puma2, especially when we try to support unique counters. So we switched to a new architecture called Puma3.
The biggest difference between Puma2 and Puma3 is that in Puma3, we do aggregations in the memory of Puma3 process instead of in HBase. Local memory operations are much faster so that we can achieve a much higher throughput.In order to make in-memory aggregations, we made Puma3 sharded by aggregation key. That means the input PTail data stream has to be sharded as well. That is supported by the application bucketing feature from Calligraphus.Each shard of Puma3 is basically a hashmap in memory. Each entry of the hashmap is a pair of an aggregation key and a user-defined aggregation, which can be count, sum, avg, or anything.We use HBase as a persistent storage but usually don’t read from it.
The write workflow for Puma3 is pretty simple.Basically, for each log line, we extract the columns for key and value. We use the key to look up the in-memory hashmap, and call user-defined aggregation with the value.Note that, since the log streams are sharded by aggregation key, the same aggregation key won’t appear in more than 1 Puma3 processes. This is the key to make Puma3 work.
We checkpoint the state of Puma3 process into HBase every 5 minutes. Basically, we save all the modified hashmap entries as well as the PTail checkpoint. That means if Puma3 crashes and restarts, it can load the state from HBase via sequential read, which is pretty fast in HBase.In order to save memory, we also get rid of hashmap entries from memory once the time window for the aggregation has passed, because we are not going to receive new log lines for that time window again.
There are 2 choices for the read workflow.If we want to read uncommitted aggregations which is usually with 10 seconds of latency, we directly serve from the in-memory hashmap. We go to HBase only for a miss, which will only happen if the time window of the aggregation has passed.If we want to read committed data, Puma3 will read from HBase and serve.Note that uncommitted aggregation result can decrease in value if the Puma3 process dies before making the next checkpoint. We plan to have a cache layer between serving and Puma3 to make sure numbers don’t decrease.
Puma3 also supports joining with a static table in HBase. The join key has to be the row key in the static HBase table. It’s implemented as a simple distributed hash lookup in a user-defined function. We have found that local cache improves the throughput of the udf a lot.
Comparing Puma2 and Puma3, we found that Puma3 is much better in writer throughput. We only need to use 25% of the boxes to handle the same work load. The main reason is that HBase is really good at write throughput.At the same time, Puma3 needs a lot of memory. Basically, all aggregations that can change needs to be stored in memory, to ensure the log stream write throughput. Right now we use 60GB of memory per box for the hashmap. In the future, we may use SSD that can easily scale to 10x more space per box.
With Puma3, we can easily support these special aggregations, with some approximation.For unique counts, we have implemented a simple adaptive sampling algorithm, that samples more aggressively when the numbe of unique item increases. We can also easily implement the standard bloom filter for counting.For the most frequent items, we plan to implement the classic lossy counting algorithm and probabilistic lossy counting algorithm.
The most important feature of Puma that distinguishes it from other stream processing projects is the language.We have built a SQL-like query language that allows us to define the input stream, the output table, as well as the query itself. Note that the query contains user-defined functions for Join as well as Aggregations.Puma3 is right now in pre-production stage. We plan to push it out in production as soon as we verified all the summaries against Puma2 and Hive.
Here are a list of things we plan to do next.First is simple scheduling for Puma3. We just need very simple scheduling because the work load is continuous. Most likely we will reuse some existing frameworks.Second is the mass adoption inside the company. We plan to migrate most daily reporting queries from Hive, as long as the query is simple enough to be supported by Puma. This will reduce the latency as well as improve the efficiency, because of the saving in compression/decompression.The third one is open-source. Right now, the biggest bottleneck is Java Thrift which has diverged between Facebook and open-source. We plan to open-source the projects one by one, starting from Calligraphus.
There are lots of similar systems in academia as well as other companies.
Instead of comparing them one by one, I will end the presentation by a summary of the key differences.Data Freeway is a scalable data stream framework with 9GB/sec throughput and 10 sec latency. It supports both Push/RPC-based and Pull/File System-based channels. We have components to support arbitrary combination of channels to adapt to the use case.Puma is a reliable stream aggregation engine. It has good support for time-window-based Group By as well as table-stream Lookup Join. It has a query language that makes Puma comparable to Hive when comparing Realtime-MR and MR. Puma has no support and no plan to support sliding window and stream joins because those are very hard problems that we don’t see in our environment.
Hic 2011 realtime_analytics_at_facebook
Real-time Analytics at Facebook:Data Freeway and PumaZheng Shao12/2/2011
Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works
Facebook Insights• Use cases▪ Websites/Ads/Apps/Pages▪ Time series▪ Demographic break-downs▪ Unique counts/heavy hitters• Major challenges▪ Scalability▪ Latency
Analytics based on Hadoop/Hive Hourly Daily seconds seconds Copier/Loader Pipeline Jobs HTTP Scribe NFS Hive MySQL Hadoop• 3000-node Hadoop cluster• Copier/Loader: Map-Reduce hides machine failures• Pipeline Jobs: Hive allows SQL-like syntax• Good scalability, but poor latency! 24 – 48 hours.
How to Get Lower Latency?• Small-batch Processing • Stream Processing▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives 15 min, every 5 min, … ▪ How to solve the reliability problem?▪ How do we reduce per-batch overhead?
Decisions• Stream Processing wins!• Data Freeway▪ Scalable Data Stream Framework• Puma▪ Reliable Stream Aggregation Engine
Scribe Batch Copier HDFS tail/fopen Scribe Scribe Scribe Mid-Tier NFS Clients Writers Log• Simple push/RPC-based logging system Consumer• Open-sourced in 2008. 100 log categories at that time.• Routing driven by static configuration.
Data Freeway Continuous Copier C1 C2 DataNode HDFS PTail C1 C2 DataNode (in the plan)Scribe PTailClients Calligraphus Calligraphus HDFS Mid-tier Writers Log Consumer Zookeeper• 9GB/sec at peak, 10 sec latency, 2500 log categories
Calligraphus• RPC File System▪ Each log category is represented by 1 or more FS directories▪ Each directory is an ordered list of files• Bucketing support▪ Application buckets are application-defined shards.▪ Infrastructure buckets allows log streams from x B/s to x GB/s• Performance▪ Latency: Call sync every 7 seconds▪ Throughput: Easily saturate 1Gbit NIC
Continuous Copier• File System File System• Low latency and smooth network usage• Deployment▪ Implemented as long-running map-only job▪ Can move to any simple job scheduler• Coordination▪ Use lock files on HDFS for now▪ Plan to move to Zookeeper
PTail files checkpointdirectorydirectorydirectory • File System Stream ( RPC ) • Reliability ▪ Checkpoints inserted into the data stream ▪ Can roll back to tail from any data checkpoints ▪ No data loss/duplicates
Overview Log Stream Aggregations Serving Storage• ~ 1M log lines per second, but light read• Multiple Group-By operations per log line• The first key in Group By is always time/date-related• Complex aggregations: Unique user count, most frequent elements
MySQL and HBase: one page MySQL HBaseParallel Manual sharding Automatic load balancingFail-over Manual master/slave Automatic switchRead efficiency High LowWrite efficiency Medium HighColumnar support No Yes
Puma2 Architecture PTail Puma2 HBase Serving• PTail provide parallel data streams• For each log line, Puma2 issue “increment” operations to HBase. Puma2 is symmetric (no sharding).• HBase: single increment on multiple columns
Puma2: Pros and Cons• Pros▪ Puma2 code is very simple.▪ Puma2 service is very easy to maintain.• Cons▪ “Increment” operation is expensive.▪ Do not support complex aggregations.▪ Hacky implementation of “most frequent elements”.▪ Can cause small data duplicates.
Improvements in Puma2• Puma2▪ Batching of requests. Didn‟t work well because of long-tail distribution.• HBase▪ “Increment” operation optimized by reducing locks.▪ HBase region/HDFS file locality; short-circuited read.▪ Reliability improvements under high load.• Still not good enough!
Puma3 Architecture PTail Puma3 HBase• Puma3 is sharded by aggregation key.• Each shard is a hashmap in memory. Serving• Each entry in hashmap is a pair of an aggregation key and a user-defined aggregation.• HBase as persistent key-value storage.
Puma3 Architecture PTail Puma3 HBase• Write workflow Serving▪ For each log line, extract the columns for key and value.▪ Look up in the hashmap and call user-defined aggregation
Puma3 Architecture PTail Puma3 HBase• Checkpoint workflow▪ Every 5 min, save modified hashmap entries, PTail checkpoint to HBase Serving▪ On startup (after node failure), load from HBase▪ Get rid of items in memory once the time window has passed
Puma3 Architecture PTail Puma3 HBase• Read workflow Serving▪ Read uncommitted: directly serve from the in-memory hashmap; load from Hbase on miss.▪ Read committed: read from HBase and serve.
Puma3 Architecture PTail Puma3 HBase• Join▪ Static join table in HBase.▪ Distributed hash lookup in user-defined function (udf). Serving▪ Local cache improves the throughput of the udf a lot.
Puma2 / Puma3 comparison• Puma3 is much better in write throughput▪ Use 25% of the boxes to handle the same load.▪ HBase is really good at write throughput.• Puma3 needs a lot of memory▪ Use 60GB of memory per box for the hashmap▪ SSD can scale to 10x per box.
Puma3 Special Aggregations• Unique Counts Calculation▪ Adaptive sampling▪ Bloom filter (in the plan)• Most frequent item (in the plan)▪ Lossy counting▪ Probabilistic lossy counting
PQL – Puma Query Language• CREATE INPUT TABLE t („time, • CREATE AGGREGATION „abc‟ „adid‟, „userid‟); INSERT INTO l (a, b, c) SELECT• CREATE VIEW v AS udf.hour(time), SELECT *, udf.age(userid) adid, FROM t age, WHERE udf.age(userid) > 21 count(1), udf.count_distinc(userid) FROM v GROUP BY• CREATE HBASE TABLE h … udf.hour(time), adid,• CREATE LOGICAL TABLE l … age;
Future Works• Scheduler Support▪ Just need simple scheduling because the work load is continuous• Mass adoption▪ Migrate most daily reporting queries from Hive• Open Source▪ Biggest bottleneck: Java Thrift dependency▪ Will come one by one
Similar Systems• STREAM from Stanford• Flume from Cloudera• S4 from Yahoo• Rainbird/Storm from Twitter• Kafka from Linkedin
Key differences• Scalable Data Streams▪ 9 GB/sec with < 10 sec of latency▪ Both Push/RPC-based and Pull/File System-based▪ Components to support arbitrary combination of channels• Reliable Stream Aggregations▪ Good support for Time-based Group By, Table-Stream Lookup Join▪ Query Language: Puma : Realtime-MR = Hive : MR▪ No support for sliding window, stream joins
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0