Big Data Real Time Analytics - A Facebook Case Study


Published on

Building Your Own Facebook Real Time Analytics System with Cassandra and GigaSpaces.

Facebook's real time analytics system is a good reference for those looking to build their real time analytics system for big data.

The first part covers the lessons from Facebook's experience and the reason they chose HBase over Cassandra.

In the second part of the session, we learn how we can build our own Real Time Analytics system, achieve better performance, gain real business insights, and business analytics on our big data, and make the deployment and scaling significantly simpler using the new version of Cassandra and GigaSpaces Cloudify.

Published in: Technology, Business
1 Comment
  • Blog References:
    1) Real Time analytics for Big Data: Facebook's New Realtime Analytics System -
    2) Real Time Analytics for Big Data: An Alternative Approach -
    3) A recorded version of the presentation is available here:
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MySQL DB Counters Have a row with a key and a counter. Results in lots of database activity. Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over.  When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention. Tried to spread the work by taking into account time zones.  Tried to shard things differently. The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy. Solution not well tailored to the problem. In-Memory Counters If you are worried about bottlenecks in IO then throw it all in-memory. No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard. Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate.  They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on. MapReduce Used Hadoop/Hive for previous solution.  Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried. Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals. Cassandra HBase seemed a better solution based on availability and the write rate. Write rate was the huge bottleneck being solved.
  • The Winner: HBase + Scribe + Ptail + Puma At a high level: HBase stores data across distributed machines. Use a tailing architecture, new events are stored in log files, and the logs are tailed. A system rolls the events up and writes them into storage. A UI pulls the data out and displays it to users. Data Flow User clicks Like on a web page. Fires AJAX request to Facebook. Request is written to a log file using Scribe.  Scribe handles issues like file roll over. Scribe is built on the same HTFS file store Hadoop is built on. Write extremely lean log lines. The more compact the log lines the more can be stored in memory. Ptail Data is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out. Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters. Plugin impression News feed impressions Actions (plugin + news feed) Puma Batch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better. Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable. Wait for last flush to complete for starting new batch to avoid lock contention issues. UI  Renders Data Frontends are all written in PHP. The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services. Caching solutions are used to make the web pages display more quickly. Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds.  The more and longer data is cached the less realtime it is. Set different caching TTLs in memcache. MapReduce The data is then sent to MapReduce servers so it can be queried via Hive. This also serves as a backup plan as the data can be recovered from Hive. Raw logs are removed after a period of time. HBase is a distribute column store.  Database interface to Hadoop. Facebook has people working internally on HBase.  Unlike a relational database you don't create mappings between tables. You don't create indexes. The only index you have a primary row key. From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime. Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur.  Based on the key, data is sharded to a region server.  Written to WAL first. Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk. If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss. Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably.  HBase handles failure detection and automatically routes across failures. Currently HBase resharding is done manually. Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet. Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan. Schema  Store on a per URL basis a bunch of counters. A row key, which is the only lookup key, is the MD5 hash of the reverse domain Selecting the proper key structure helps with scanning and sharding. A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there.  For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding. A reverse domain,  com.facebook/  for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains.  Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis.  Per server they can handle 10,000 writes per second.  Checkpointing is used to prevent data loss when reading data from log files.  Tailers save log stream check points  in HBase. Replayed on startup so won't lose data. Useful for detecting click fraud, but it doesn't have fraud detection built in. Tailer Hot Spots In a distributed system there's a chance one part of the system can be hotter than another. One example are region servers that can be hot because more keys are being directed that way. One tailer can be lag behind another too. If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI? For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour. Solution is to figure out the least up to date tailer and use that when querying metrics.
  • A Potential for Improvement There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook's working system: We can simplify the design. If memory can be seen as transactional - and it can - we can use them without transforming them as they proceed along our analytics workflow. This makes our design and implementation much simpler to implement and test, and performance improves as well. We can strengthen the design. With a polling semantic, such systems are brittle, relying on systems that pull data in order to generate realtime analytics data. We should be able to reduce the fragility of the system, even while making it faster. We can strengthen the implementation. With batching subsystems, there are limits shouldn’t exist. For example, one concern in Facebook's implementation is the use of an in-memory hash table that stores intermediate data; the in-memory aspect isn’t a concern until you realize that the batch sizes are chosen partially to make sure that this hash table doesn’t overflow available space. We can allow deployments to change databases based on their requirements. There's nothing wrong with HBase, but it's got specific characteristics that aren't appropriate for all enterprises. We can design a system which you’d be able to deploy on various and flexible platforms, and we can migrate the underlying long-term data store to a different database if needed. We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented  in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult. What we want to do, then, is create a general model for an application that can accomplish the same goals as Facebook’s realtime analytics system, while leveraging the capabilities that in-memory data grids offer where available, potentially offering improvement in the areas of scalability, manageability, latency, platform neutrality, and simplicity, all while increasing ease of data access. That sounds like quite a tall order, but it’s doable. The key is to remember that at heart, realtime analytics represent an events system. Facebook’s entire architecture is designed to funnel events through various channels, such that they can safely and sequentially manage event updates. Therefore, they receive a massive set of events that “look like” marbles, which they line up in single file; they then sort the marbles by color, you might say, and for each color they create a bundle of sticks; the sticks are lit on fire, and when the heat goes up past a certain temperature, steam is generated, which turns a turbine. It’s a real-life Rube Goldberg machine, which is admirable in that it works, but much of it is still unnecessary if the assumptions about memory ("unreliable") and database ("HBase is the only target that counts") are changed. Looking at the analogy from the previous paragraph, there’s no need to change a marble into anything. The marble is enough.
  • Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
  • Value Data Grid like GigaSpaces comes with rich set of API that provides not only the mean to store data fast and reliably but also access the data, query it just as you would do with a database. Specifically for GigaSpaces we support both JPA and Document API and the way to mix and match between those API’s Unlike Scribe and log system we can now look at the data as it comes in and not only once it is stored into the database The later makes it possible to partition data based on time – First day in memory and the rest through the database etc.
  • Collocating the processing with the data can provides the biggest gain in terms of scalbility and performance as we reduce the amount of network hops as well as serialization overhead. We also reduce the number of moving parts which in itself simplifies our runtime architecture and our ability to scale. The other benefit is that we decentralize the Puma services from the facebook example and thus make the entire architecture significantly more scalable.
  • He snippet of code shows the part of the code that generate the statistical information as the events comes in The template defines the fliter for the events. In the above example we will filter any event that is of type Data that has a false value in its “processed” attribute. For every event that match this filter the method eventListener will be called with the appropriate data object.
  • Value gained: Avoid lockin to specific NoSQL API Performance – reduced network hops, serialization overhead Simplicity – less moving parts Scalability without compromising on consistency (Strict consistency at the front, eventual consistency for the long term data) JPA/Stanard API
  • content based routing, workflow
  • Big Data Real Time Analytics - A Facebook Case Study

    1. 1. Real Time Analytics for Big Data Lessons from Facebook..
    2. 2. The Real Time Boom.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Google Real Time Web Analytics Google Real Time Search Facebook Real Time Social Analytics Twitter paid tweet analytics SaaS Real Time User Tracking New Real Time Analytics Startups..
    3. 3. Analytics @ Twitter
    4. 4. Note the Time dimension
    5. 5. The data resolution & processing models
    6. 6. Traditional analytics applications <ul><li>Scale-up Database </li></ul><ul><ul><li>Use traditional SQL database </li></ul></ul><ul><ul><li>Use stored procedure for event driven reports </li></ul></ul><ul><ul><li>Use flash memory disks to reduce disk I/O </li></ul></ul><ul><ul><li>Use read only replica to scale-out read queries </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>Doesn’t scale on write </li></ul></ul><ul><ul><li>Extremely expensive (HW + SW) </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    7. 7. CEP – Complex Event Processing <ul><li>Process the data as it comes </li></ul><ul><li>Maintain a window of the data in-memory </li></ul><ul><li>Pros: </li></ul><ul><ul><li>Extremely low-latency </li></ul></ul><ul><ul><li>Relatively low-cost </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>Hard to scale (Mostly limited to scale-up) </li></ul></ul><ul><ul><li>Not agile - Queries must be pre-generated </li></ul></ul><ul><ul><li>Fairly complex </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    8. 8. In Memory Data Grid <ul><li>Distributed in-memory database </li></ul><ul><li>Scale out </li></ul><ul><li>Pros </li></ul><ul><ul><li>Scale on write/read </li></ul></ul><ul><ul><li>Fits to event driven (CEP style) , ad-hoc query model </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>Cost of memory vs disk </li></ul></ul><ul><ul><li>Memory capacity is limited </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    9. 9. NoSQL <ul><li>Use distributed database </li></ul><ul><ul><li>Hbase, Cassandra, MongoDB </li></ul></ul><ul><li>Pros </li></ul><ul><ul><li>Scale on write/read </li></ul></ul><ul><ul><li>Elastic </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>Read latency </li></ul></ul><ul><ul><li>Consistency tradeoffs are hard </li></ul></ul><ul><ul><li>Maturity – fairly young technology </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    10. 10. Hadoop MapReudce <ul><li>Distributed batch processing </li></ul><ul><li>Pros </li></ul><ul><ul><li>Designed to process massive amount of data </li></ul></ul><ul><ul><li>Mature </li></ul></ul><ul><ul><li>Low cost </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>Not real-time </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    11. 11. Hadoop Map/Reduce – Reality check.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    12. 12. So what’s the bottom line? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    13. 13. Facebook Real-time Analytics System ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    14. 14. Goals <ul><li>Show why plugins are valuable. </li></ul><ul><ul><li>What value is your business deriving from it? </li></ul></ul><ul><li>Make the data more actionable. </li></ul><ul><ul><li>Help users take action to make their content more valuable. </li></ul></ul><ul><ul><li>How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site.   </li></ul></ul><ul><li>Make the data more timely.  </li></ul><ul><ul><li>Went from a 48-hour turn around to 30 seconds. </li></ul></ul><ul><ul><li>Multiple points of failure were removed to make this goal.  </li></ul></ul><ul><li>Handle massive load </li></ul><ul><ul><li>20 billion events per day (200,000 events per second) </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    15. 15. The actual analytics.. <ul><li>Like button analytics </li></ul><ul><li>Comments box analytics </li></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    16. 16. Technology Evaluation <ul><li>MySQL DB Counters </li></ul><ul><li>In-Memory Counters </li></ul><ul><li>MapReduce </li></ul><ul><li>Cassandra </li></ul><ul><li>HBase </li></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    17. 17. The solution.. PTail Scribe Puma Hbase HDFS Real Time Long Term Batch 1.5 Sec 10,000 write/sec per server FACEBOOK Log FACEBOOK Log FACEBOOK Log
    18. 18. Checking the assumptions.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    19. 19. Facebook Analytics.Next.. <ul><li>What if.. </li></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved <ul><ul><li>We can rely on memory as a reliable store? </li></ul></ul><ul><ul><li>We can’t decide on a particular NoSQL database? </li></ul></ul><ul><ul><li>We need to package the solution as a product? </li></ul></ul>
    20. 20. Step 1: Use memory.. <ul><li>Instead of treating memory as a cache, why not treat it as a primary data store? </li></ul><ul><ul><li>Facebook keeps 80% of its data in Memory (Stanford research) </li></ul></ul><ul><ul><li>RAM is 100-1000x faster than Disk (Random seek) </li></ul></ul><ul><ul><ul><li>Disk - 5 -10ms </li></ul></ul></ul><ul><ul><ul><li>RAM – x0.001msec </li></ul></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Events Memory Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK
    21. 21. Step 1: Use memory.. <ul><li>Reliability is achieved through redundancy and replication </li></ul><ul><li>One Data. Any API </li></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Events Any API Data Grid FACEBOOK FACEBOOK FACEBOOK
    22. 22. Step 2 – Collocate <ul><li>Putting the code together with the data. </li></ul>Events Processing Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK
    23. 23. Step 2 – Collocate <ul><li>Putting the code together with the data. </li></ul>Events Processing Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK @EventDriven @Polling public class SimpleListener { @EventTemplate Data unprocessedData () { Data template = new Data (); template . setProcessed ( false ); return template ; } @SpaceDataEvent public Data eventListener ( Data event ) { //process Data here } }
    24. 24. Step 3 – Write behind to SQL/NoSQL Events Processing Grid Open Long Term persistency Write Behind FACEBOOK FACEBOOK FACEBOOK Data Grid Data Grid Data Grid
    25. 25. Economic Data Scaling <ul><li>Combine memory and disk </li></ul><ul><ul><li>Memory is x100, x1000 lower than disk for high data access rate (Stanford research) </li></ul></ul><ul><ul><li>Disk is lower at cost for high capacity lower access rate. </li></ul></ul><ul><ul><li>Solution: </li></ul></ul><ul><ul><ul><li>Memory - short-term data, </li></ul></ul></ul><ul><ul><ul><li>Disk - long term. data </li></ul></ul></ul><ul><ul><li>Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server. </li></ul></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Memory Disk
    26. 26. Economic Scaling <ul><li>Automation - reduce operational cost </li></ul><ul><li>Elastic Scaling – reduce over provisioning cost </li></ul><ul><li>Cloud portability (JClouds) – choose the right cloud for the job </li></ul><ul><li>Cloud bursting – scavenge extra capacity when needed </li></ul>® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    27. 27. Putting it all together Analytic Application Event Sources Write behind <ul><li>- In Memory Data Grid </li></ul><ul><li>- RT Processing Grid </li></ul><ul><li>Light Event Processing </li></ul><ul><li>Map-reduce </li></ul><ul><li>Event driven </li></ul><ul><li>Execute code with data </li></ul><ul><li>Transactional </li></ul><ul><li>Secured </li></ul><ul><li>Elastic </li></ul><ul><li>NoSQL DB </li></ul><ul><li>Low cost storage </li></ul><ul><li>Write/Read scalability </li></ul><ul><li>Dynamic scaling </li></ul><ul><li>Raw Data and aggregated Data </li></ul>Generate Patterns
    28. 28. Putting it all together Analytic Application Event Sources Write behind <ul><li>- In Memory Data Grid </li></ul><ul><li>- RT Processing Grid </li></ul><ul><li>Light Event Processing </li></ul><ul><li>Map-reduce </li></ul><ul><li>Event driven </li></ul><ul><li>Execute code with data </li></ul><ul><li>Transactional </li></ul><ul><li>Secured </li></ul><ul><li>Elastic </li></ul><ul><li>NoSQL DB </li></ul><ul><li>Low cost storage </li></ul><ul><li>Write/Read scalability </li></ul><ul><li>Dynamic scaling </li></ul><ul><li>Raw Data and aggregated Data </li></ul>Generate Patterns Real Time Map/Reduce R Script script = new StaticScritpt( “groovy”,”println hi; return 0”) Query q = em.createNativeQuery( “execute ?”); q.setParamter(1, script); Integer result = query.getSingleResult();
    29. 29. 5x better performance per server! <ul><li>Hardware – Linux </li></ul><ul><ul><li>HP DL380 G6 servers - each has: </li></ul></ul><ul><ul><li>2 Intel quad-core Xeon X5560 processors (2.8 Ghz Nehalem) </li></ul></ul><ul><ul><li>32 Gb RAM (4GB per core) </li></ul></ul><ul><ul><li>6 * 146 Gb 15K RPM SAS disks </li></ul></ul><ul><ul><li>Red Hat 5.2 </li></ul></ul>Event injector Up to 128 threads GigaSpaces/ (Other Msg Server) App Services Up to 128 threads Other Giga 50,000 write/sec per server
    30. 30. Live demo Inter Day Activity (Real Time) Monthly Trend Analysis
    31. 31. 5 Big Data Predictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    32. 32. Summary Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability ,.. Its Open: Use Any Stack : Avoid Lockin Any database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks . All While Minimizing Cost Use Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costs Elasticity – reduce over provisioning cost
    33. 33. Further reading.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    34. 34. Thank YOU! @natishalom