Like this presentation? Why not share!

5,088

-1

-1

Published on

No Downloads

Total Views

5,088

On Slideshare

0

From Embeds

0

Number of Embeds

4

Shares

0

Downloads

44

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Analytics for the Real-Time Web <ul><ul><ul><ul><li>Maxim Grinev, Maria Grineva and Martin Hentschel </li></ul></ul></ul></ul>
- 2. Outline <ul><li>Real-time Web and its new requirements </li></ul><ul><li>Our system: Triggy </li></ul><ul><ul><li>Short overview of Cassandra </li></ul></ul><ul><ul><li>Triggy internal mechanisms </li></ul></ul><ul><li>Similar systems: </li></ul><ul><ul><li>Yahoo! ’ s S4 </li></ul></ul><ul><ul><li>Google ’ s Percolator </li></ul></ul><ul><li>Applications </li></ul>
- 3. Real-Time Web <ul><li>Web 2.0 + mobile devices = Real-Time Web </li></ul><ul><li>People share what they do now, discuss breaking news on Twitter, share their current locations on Foursquare... </li></ul>
- 4. Analytics for the Real-Time Web: new requirements <ul><li>MapReduce - state-of-the-art for processing of Web 2.0 data </li></ul><ul><li>Batch processing (MapReduce) is too slow now </li></ul><ul><li>New requirements: </li></ul><ul><ul><ul><li>real-time processing: aggregate values incrementally, as new data arrives </li></ul></ul></ul><ul><ul><ul><li>data-base intensive: aggregate values are stored in a database constantly being updated </li></ul></ul></ul>
- 5. Our System: Triggy <ul><li>Based on Cassandra, a distributed key-value store </li></ul><ul><li>Provides programming model similar to MapReduce, adapted to push-style processing </li></ul><ul><li>Extends Cassandra with </li></ul><ul><ul><li>push-style procedures - to immediately propagate the data to computations; </li></ul></ul><ul><ul><li>serialization - to ensure serialized access to aggregate values (counters) </li></ul></ul><ul><li>Easily scalable </li></ul>
- 6. Cassandra Overview Data Model <ul><li>Data Model: key-value </li></ul><ul><li>Extends basic key-value with 2 levels of nesting </li></ul><ul><li>Super column - if the second level is presented </li></ul><ul><li>Column family ~ table; </li></ul><ul><li>key-value pair ~ record </li></ul><ul><li>Keys are stored ordered </li></ul>
- 7. Cassandra Overview Incremental Scalability <ul><li>Incremental scalability requires mechanism to dynamically partition data over the nodes </li></ul><ul><li>Data partitioned by key using consistent hashing </li></ul><ul><li>Advantage of consistent hashing: departure or arrival of a node affects only its immediate neighbors, other nodes remain unaffected </li></ul>
- 8. Cassandra Overview Log-Structured Storage <ul><li>Optimized for write-intensive workloads (log-structured storage) </li></ul>
- 9. Triggy Programming Model <ul><li>Modified MapReduce to support push-style processing </li></ul><ul><li>Modified only reduce function: reduce* </li></ul><ul><li>reduce* incrementally applies a new input value to an already existing aggregate value </li></ul>Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> (k2, v3)
- 10. Triggy Programming Model
- 12. Triggy Execution of Maps and Reduces <ul><li>We extend each node with a queue and worker threads (which execute Map and Reduce tasks buffered in the queue) </li></ul><ul><li>Map tasks can be executed in parallel at any node in the system and do not require serialization because they do not share any data </li></ul><ul><li>Execution of reduce* tasks has to be serialized for the same key to guarantee correct results: </li></ul><ul><ul><li>We make use of Cassandra ’ s partitioning strategy: equal keys are routed to the same node </li></ul></ul><ul><ul><li>Serialization within a node: locks on keys that are being processed right now </li></ul></ul>
- 13. Triggy Fault Tolerance <ul><li>No fault tolerance guarantees for execution of map/reduce* tasks: Intermediate data in queue can be lost </li></ul><ul><li>Not critical for analytical applications </li></ul>
- 14. Triggy Scalability <ul><li>Computation and data storage are tightly coupled: by moving the data, you move the computation - it allows to scale the system easily </li></ul><ul><li>A new node is placed near the most loaded node, part of data is transferred to the new node </li></ul>
- 15. Experiments <ul><li>Generated workload: tweets with user ids (1 .. 100000) with uniform distribution </li></ul><ul><li>The load generator issues as many requests as the system with N can handle </li></ul><ul><li>Application: count the number of words posted by each user Map: tweet => ( user_id , number_of_words_in_tweet ) Reduce: ( user_id , numer_of_words_total , number_of_words_in_tweet ) => ( user_id , number_of_words_total ) </li></ul>
- 16. Similar Systems: Yahoo! ’ s S4 <ul><li>Distributed stream processing engine: </li></ul><ul><ul><li>Programming interface: Processing Elements written in Java </li></ul></ul><ul><ul><li>Data routed between Processing Elements by key </li></ul></ul><ul><ul><li>No database. All processing in memory keeping the window of the input data at each Processing Element. </li></ul></ul><ul><li>Used to estimate Click-Through-Rate using user ’ s behavior within a time window </li></ul>
- 17. Similar Systems: Google ’ s Percolator <ul><li>Percolator is for incremental data processing: based on BigTable </li></ul><ul><li>BigTable - a distributed key-value store: </li></ul><ul><ul><li>the same data model as in Cassandra </li></ul></ul><ul><ul><li>the same log-structured storage </li></ul></ul><ul><ul><li>BigTable - a distributed system with a master; Cassandra - peer2peer </li></ul></ul><ul><li>Used in Google for incremental update of Web Search Index (instead of MapReduce) </li></ul>
- 18. Percolator ’ s Push-style Processing (Observers) <ul><li>Percolator extends BigTable with </li></ul><ul><ul><li>distributed ACID transactions: </li></ul></ul><ul><ul><ul><li>multi-version mechanism with snapshot isolation semantics </li></ul></ul></ul><ul><ul><ul><li>two-phase commit </li></ul></ul></ul><ul><ul><li>observers (similar to database triggers for push-style processing) </li></ul></ul><ul><li>Advantage: no data loss - for each inserted document, the associated observers will be executed </li></ul><ul><li>Overhead: distributed transactions and durable storage of intermediate data </li></ul>
- 19. Applications <ul><li>Tracking millions of parameters individually (browser cookies, URLs ...) </li></ul><ul><li>Incremental computation of analytical values allows real-time reaction on events </li></ul><ul><li>Monitoring without time window or window of any size for each parameter </li></ul>
- 20. Real-Time Advertising (Short Overview) <ul><li>Real-Time bidding: </li></ul><ul><ul><li>Sites track your browsing behavior via cookies and sell it to advertising services </li></ul></ul><ul><ul><li>Web publishers offer up display inventory to advertising services </li></ul></ul><ul><ul><li>No fixed CPM, instead: each ad impression is sold to the highest bidder </li></ul></ul><ul><li>Retargeting (remarketing) </li></ul><ul><ul><li>Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere. </li></ul></ul>
- 21. Using Social Network Profiles to Enhance Advertising <ul><li>Watching for readiness for a purchase intent among Twitter users </li></ul>
- 22. Application Social Media Optimization for news sites <ul><li>A/B testing for headlines of news stories </li></ul><ul><li>Optimization of front page to attract more clicks </li></ul>
- 23. Real-Time News Recommendations <ul><li>TwitterTim.es: </li></ul><ul><ul><li>using social graph to recommend news </li></ul></ul><ul><ul><li>now - batch rebuilding every 2 hours </li></ul></ul><ul><ul><li>goal - real-time updating newspaper </li></ul></ul><ul><li>Google News: </li></ul><ul><ul><li>recommendation via collaborative filtering based on users clicks </li></ul></ul><ul><ul><li>new stories and clicks are constantly coming </li></ul></ul><ul><ul><li>now - batch processing using MapReduce </li></ul></ul>
- 24. Other Applications <ul><li>Recommendations on location checkins: Foursquare, Facebook places... </li></ul><ul><li>Social Games: monitoring events from millions of users in real-time, react in real-time </li></ul>
- 25. Questions?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment