Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analytics for the Real-Time Web


Published on

Published in: Technology, Business
  • Be the first to comment

Analytics for the Real-Time Web

  1. 1. Analytics for the Real-Time Web <ul><ul><ul><ul><li>Maxim Grinev, Maria Grineva and Martin Hentschel </li></ul></ul></ul></ul>
  2. 2. Outline <ul><li>Real-time Web and its new requirements </li></ul><ul><li>Our system: Triggy </li></ul><ul><ul><li>Short overview of Cassandra </li></ul></ul><ul><ul><li>Triggy internal mechanisms </li></ul></ul><ul><li>Similar systems: </li></ul><ul><ul><li>Yahoo! ’ s S4 </li></ul></ul><ul><ul><li>Google ’ s Percolator </li></ul></ul><ul><li>Applications </li></ul>
  3. 3. Real-Time Web <ul><li>Web 2.0 + mobile devices = Real-Time Web </li></ul><ul><li>People share what they do now, discuss breaking news on Twitter, share their current locations on Foursquare... </li></ul>
  4. 4. Analytics for the Real-Time Web: new requirements <ul><li>MapReduce - state-of-the-art for processing of Web 2.0 data </li></ul><ul><li>Batch processing (MapReduce) is too slow now </li></ul><ul><li>New requirements: </li></ul><ul><ul><ul><li>real-time processing: aggregate values incrementally, as new data arrives </li></ul></ul></ul><ul><ul><ul><li>data-base intensive: aggregate values are stored in a database constantly being updated </li></ul></ul></ul>
  5. 5. Our System: Triggy <ul><li>Based on Cassandra, a distributed key-value store </li></ul><ul><li>Provides programming model similar to MapReduce, adapted to push-style processing </li></ul><ul><li>Extends Cassandra with </li></ul><ul><ul><li>push-style procedures - to immediately propagate the data to computations; </li></ul></ul><ul><ul><li>serialization - to ensure serialized access to aggregate values (counters) </li></ul></ul><ul><li>Easily scalable </li></ul>
  6. 6. Cassandra Overview Data Model <ul><li>Data Model: key-value </li></ul><ul><li>Extends basic key-value with 2 levels of nesting </li></ul><ul><li>Super column - if the second level is presented </li></ul><ul><li>Column family ~ table; </li></ul><ul><li>key-value pair ~ record </li></ul><ul><li>Keys are stored ordered </li></ul>
  7. 7. Cassandra Overview Incremental Scalability <ul><li>Incremental scalability requires mechanism to dynamically partition data over the nodes </li></ul><ul><li>Data partitioned by key using consistent hashing </li></ul><ul><li>Advantage of consistent hashing: departure or arrival of a node affects only its immediate neighbors, other nodes remain unaffected </li></ul>
  8. 8. Cassandra Overview Log-Structured Storage <ul><li>Optimized for write-intensive workloads (log-structured storage) </li></ul>
  9. 9. Triggy Programming Model <ul><li>Modified MapReduce to support push-style processing </li></ul><ul><li>Modified only reduce function: reduce* </li></ul><ul><li>reduce* incrementally applies a new input value to an already existing aggregate value </li></ul>Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> (k2, v3)
  10. 10. Triggy Programming Model
  11. 12. Triggy Execution of Maps and Reduces <ul><li>We extend each node with a queue and worker threads (which execute Map and Reduce tasks buffered in the queue) </li></ul><ul><li>Map tasks can be executed in parallel at any node in the system and do not require serialization because they do not share any data </li></ul><ul><li>Execution of reduce* tasks has to be serialized for the same key to guarantee correct results: </li></ul><ul><ul><li>We make use of Cassandra ’ s partitioning strategy: equal keys are routed to the same node </li></ul></ul><ul><ul><li>Serialization within a node: locks on keys that are being processed right now </li></ul></ul>
  12. 13. Triggy Fault Tolerance <ul><li>No fault tolerance guarantees for execution of map/reduce* tasks: Intermediate data in queue can be lost </li></ul><ul><li>Not critical for analytical applications </li></ul>
  13. 14. Triggy Scalability <ul><li>Computation and data storage are tightly coupled: by moving the data, you move the computation - it allows to scale the system easily </li></ul><ul><li>A new node is placed near the most loaded node, part of data is transferred to the new node </li></ul>
  14. 15. Experiments <ul><li>Generated workload: tweets with user ids (1 .. 100000) with uniform distribution </li></ul><ul><li>The load generator issues as many requests as the system with N can handle </li></ul><ul><li>Application: count the number of words posted by each user Map: tweet => ( user_id , number_of_words_in_tweet ) Reduce: ( user_id , numer_of_words_total , number_of_words_in_tweet ) => ( user_id , number_of_words_total ) </li></ul>
  15. 16. Similar Systems: Yahoo! ’ s S4 <ul><li>Distributed stream processing engine: </li></ul><ul><ul><li>Programming interface: Processing Elements written in Java </li></ul></ul><ul><ul><li>Data routed between Processing Elements by key </li></ul></ul><ul><ul><li>No database. All processing in memory keeping the window of the input data at each Processing Element. </li></ul></ul><ul><li>Used to estimate Click-Through-Rate using user ’ s behavior within a time window </li></ul>
  16. 17. Similar Systems: Google ’ s Percolator <ul><li>Percolator is for incremental data processing: based on BigTable </li></ul><ul><li>BigTable - a distributed key-value store: </li></ul><ul><ul><li>the same data model as in Cassandra </li></ul></ul><ul><ul><li>the same log-structured storage </li></ul></ul><ul><ul><li>BigTable - a distributed system with a master; Cassandra - peer2peer </li></ul></ul><ul><li>Used in Google for incremental update of Web Search Index (instead of MapReduce) </li></ul>
  17. 18. Percolator ’ s Push-style Processing (Observers) <ul><li>Percolator extends BigTable with </li></ul><ul><ul><li>distributed ACID transactions: </li></ul></ul><ul><ul><ul><li>multi-version mechanism with snapshot isolation semantics </li></ul></ul></ul><ul><ul><ul><li>two-phase commit </li></ul></ul></ul><ul><ul><li>observers (similar to database triggers for push-style processing) </li></ul></ul><ul><li>Advantage: no data loss - for each inserted document, the associated observers will be executed </li></ul><ul><li>Overhead: distributed transactions and durable storage of intermediate data </li></ul>
  18. 19. Applications <ul><li>Tracking millions of parameters individually (browser cookies, URLs ...) </li></ul><ul><li>Incremental computation of analytical values allows real-time reaction on events </li></ul><ul><li>Monitoring without time window or window of any size for each parameter </li></ul>
  19. 20. Real-Time Advertising (Short Overview) <ul><li>Real-Time bidding: </li></ul><ul><ul><li>Sites track your browsing behavior via cookies and sell it to advertising services </li></ul></ul><ul><ul><li>Web publishers offer up display inventory to advertising services </li></ul></ul><ul><ul><li>No fixed CPM, instead: each ad impression is sold to the highest bidder </li></ul></ul><ul><li>Retargeting (remarketing) </li></ul><ul><ul><li>Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere. </li></ul></ul>
  20. 21. Using Social Network Profiles to Enhance Advertising <ul><li>Watching for readiness for a purchase intent among Twitter users </li></ul>
  21. 22. Application Social Media Optimization for news sites <ul><li>A/B testing for headlines of news stories </li></ul><ul><li>Optimization of front page to attract more clicks </li></ul>
  22. 23. Real-Time News Recommendations <ul><li> </li></ul><ul><ul><li>using social graph to recommend news </li></ul></ul><ul><ul><li>now - batch rebuilding every 2 hours </li></ul></ul><ul><ul><li>goal - real-time updating newspaper </li></ul></ul><ul><li>Google News: </li></ul><ul><ul><li>recommendation via collaborative filtering based on users clicks </li></ul></ul><ul><ul><li>new stories and clicks are constantly coming </li></ul></ul><ul><ul><li>now - batch processing using MapReduce </li></ul></ul>
  23. 24. Other Applications <ul><li>Recommendations on location checkins: Foursquare, Facebook places... </li></ul><ul><li>Social Games: monitoring events from millions of users in real-time, react in real-time </li></ul>
  24. 25. Questions?