Analytics for the Real-Time Web


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • The Web 2.0 era is characterized by the emergence of large amounts of user-generated content. People started generate and contribute data on different Web services: blogs, social networks, Wikipedia. \n\nToday, with the emergence of mobile devices constantly connected to the Internet, that nature of user-generated content has changed. Now people contribute more often, with smaller posts and the life-span of these posts became shorter. \n\nNew Web services appear that encourage real-time usage:\n1) Twitter\nLifespan of each tweet is shorter than it was before for Blog post. Twitter stream is almost real-time.\n2) Location-based social networks: Foursquare, Facebook places. People share their current location (or checkin) at real venues. This data is real-time sensitive, the user reveals his current location and recommendation of near-by friends and other interesting places must be done immediately, while the user is there.\n\n
  • So far, analyzing and making use of Web 2.0 data has been accomplished using batch-style processing. Data produced over a certain period of time is accumulated and then processed. MapReduce has become the state-of-the-art approach for analytical batch processing of user-generated data.\n\nToday, the Web 2.0 data has become more real-time and this change implies new requirements for analytical systems. Processing data in batches is too slow for real-time sensitive data. Accumulated data can lose its importance in several hours or, even, minutes. Therefore, analytical systems must aggregate values in real-time, incrementally, as new data arrives. It follows that workloads are database-intensive because aggregate values are not produced at once, as in batch processing, but stored in a database constantly being updated. For example, Google’s new web indexing system, Percolator, is not based on MapReduce anymore. Percolator allows lower document processing latencies by updating the web index incrementally (database-intensive).\n\n
  • We are working on a system that can process analytical tasks at real-time for large amounts of data.\n\nOur system is based on Cassandra distributed key-value store.\nWe add two extensions into Cassandra in order to turn it into a system for real-time analytics: push-style procedures and synchronization.\n\nWe extend Cassandra with push-style procedures. These procedures act like triggers, you can set it onto a table and they fire up when a new key-value record is inserted. They make the computation real-time, as they immediately propagate the inserted data to the analytical computations.\n\nSynchronization: Cassandra is a simple key-value store. There is no mechanism to update a value based on the existing value. For example, to maintain counters, when we need to increment the existing value we first need to query it, and then insert a new value. In Cassandra, there is no transactions, that means, between querying and updating other client can also update the value. That leads to inconsistent counters. We add local synchronization into Cassandra, that can synchronize data within a node.\n\nFurthermore, our system provides a programming model similar to MapReduce, adapted to push-style processing, and is scalable in terms of computation and data storage.\n
  • In a nutshell, Cassandra data model can be described as follows:\n1) Cassandra is based on a key-value model\nA database consists of column families. A column family is a set of key-value pairs. Drawing an analogy with relational databases, you can think about column family as table and a key-value pair as a record in a table.\n2) Cassandra extends basic key-value model with two levels of nesting\nAt the first level the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns. \nAt the second level, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns with key being the name of the super column and inner key-value pairs are called columns.\nLet’s consider an classical example of Twitter database to demonstrate the points.\nColumn family Tweets contains records representing tweets. The key of a record is of Time UUID type and generated when the tweet is received (we will use this feature in User_Timelines column family below). The record consist of columns (no super columns here). Columns simply represent attributes of tweets. So it is very similar to how one would store it in a relational database.\nThe next example is User_Timelines (i.e. tweets posted by a user). Records are keyed by user IDs (referenced by User_ID columns in Tweets column family). User_Timelines demonstrates how column names can be used to store values – tweet IDs in this case. The type of column names is defined as Time UUID. It means that tweets IDs are kept ordered by the time of posting. That is very useful as we usually want to show the last N tweets for a user. Values of all columns are set to an empty byte array (denoted “-”) as they are not used.\nTo demonstrate super columns let us assume that we want to collect statistics about URLs posted by each user. For that we need to group all the tweets posted by a user by URLs contained in the tweets. It can be stored using super columns as follows.\nIn User_URLs the names of the super columns are used to store URLs and the names of the nested columns are the corresponding tweet IDs.\n\n\n
  • One of the key features of Cassandra is that it must scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes. Cassandra’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. \n\nIn consistent hashing, the output range of a hash function (which is normally MD5 ) is treated as a fixed circular space or a ring. By this, I mean, that the largest hash value wraps around to the smallest hash value. \n\nEach node in the system is assigned a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. The node is deemed the coordinator for this key. Thus, each node becomes responsible for the region in the ring between it and previous node on the ring.\n\nThe principal advantage of the consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. \n\nThe problem with MD5 hash function for nodes distribution: the random position assignment of each node on the ring leads to non-uniform load and data distribution. That’s why Cassandra analyzes load information on the ring and inserts new nodes near the highly loaded nodes, so that the overloaded node can transfer the data from it onto the new node.\n\n
  • Cassandra is optimized for write-intensive workloads, that is a useful feature for us, as computing aggregate values for analytical tasks implies heavy updates to the system\n\nCassadra uses so called log-structured stored which was successfully used in BigTable.\nThe idea is that write operations write to buffer in main memory. When the buffer is full, it is written on disk. So, in the result, the buffer is periodically written on disk. And there is a separate thread that merges different versions a sstable. This process is called compaction.\n\nRead operation looks up the value first in memtable, then, if it was not found, in different versions of sstable moving from the most recent one.\n\nSuch storage is highly optimized for writes, and of course makes the queries slower, which is always a tradeoff for databases.\n
  • MapReduce is a well-established programming model to express analytical applications. To support real-time analytical applications, we modify this programming model to support push-style data processing. In particular, we modify the reduce function. Originally, reduce combined a list of input values into a single aggregate value. Our modified function, reduce∗, incrementally applies a new input value to an already existing aggregate value. This modification allows to apply a new input value to the aggregate value as soon as the new input value is produced. This means,we are able to pushnewvaluestothe reduce function. \n\nFigure 1 depicts our modified programming model. reduce∗ takes as parameters a key, a new value, and the existing aggregate value. It outputs a key-value pair with the same key and the new aggregate value. We did not modify the map function as it is already allows push-style processing. The difference between map and reduce∗ is that multiple maps can be executed in parallel for the same key, while the execution of reduce∗ has to be synchronized for the same key to guarantee correct results. \n\nNote that reduce∗ exhibits some limitations in comparison to the original reduce. Not every reduce function can be converted to its incremental counterpart. For example, to compute the median of a set of values, the previous median and new value is not enough to compute the new median. The complete set of values needs to be stored to compute the new median.\n\n
  • In order to setup a map/reduce∗ job the developer has to provide implementations for both functions and define the input table, from which the data is fed into map, and the output table, to which the output of reduce∗ is written.\n\n
  • Example: implementation of WordCountMapReducer\n
  • The difference between map and reduce∗ is that multiple maps can be executed in parallel for the same key, while the execution of reduce∗ has to be synchronized for the same key to guarantee correct results.\n\nFor that, we extended the nodes of the key-value store adding queues and worker threads. Figure 2 shows our extensions. Each node maintains a queue that buffers map and reduce∗ tasks. Worker threads drain the queues and execute buffered tasks. Buffering map and reduce∗ tasks allows to handle bursts of input data. Furthermore, the size of the queue allows a rough estimation of the load of a node.\n\nHow to Execute map. As described, for each map the developer has to define an input table. Whenever a new key-value pair is written to this table, the node handling this write schedules a new map task by putting it into its local queue. Eventually, a worker thread will execute the map task at this node. Map tasks can be executed in parallel at any node in the system and do not require synchronization because they do not share any data.\n\nHow to Execute reduce∗. In contrast to map, the execution of reduce∗ needs to be synchronized because several reduce∗ tasks can potentially update the same aggregate value in parallel leading to inconsistent data. Cassandra do not provide any synchronization mechanisms. In our system, synchronization is realized in two steps: (1) routing all key-value pairs output by map with the same key to a single node, and (2) synchronizing the execution of reduce∗ within a node using locks. Routing is implemented by reusing Cassandra’s partitioning strategy (using consistent hashing). That is, each key-value pair output by map is routed to the node that is primarily responsible for the respective key. At the receiver node, a new reduce∗ task is submitted to the queue. Multiple worker threads execute these reduce∗ tasks by reading and incrementing the latest aggregate value. Workers threads are synchronized such that only one worker executes a reduce∗ task for a given key. For that, we use a lock table that contains keys being processed by each worker. The output of the reduce∗ task is written to the table specified in the reduce definition. The table may be replicated to achieve reliability. By writing the result, the node might fire a subsequent map/reduce∗ task. The result of reduce∗ can be queried using the key-value store’s standard query interface.\n\nThe figure shows the execution of map and reduce∗ inside oursystem.Twokey-valuepairs(k1 , v1 )and(k1 , v2 )are written to nodes N1 and N5 of the key-value store. These writes fire map tasks defined on the updated table. There- fore,receivernodeN1putsamaptaskforpair(k1 , v1 )into its queue (denoted by m in Figure 2). Similarly, node N5 putsamaptaskforpair(k1 , v2 )intoitsqueue.Theexecution of the map tasks results in three intermediate key-value pairs. Determined by Cassandra’s partitioning strategy, the intermediate pair with key k2 is routed to node N2 while pairs with key k3 are routed to node N3. Nodes N2 and N3 put reduce∗ tasks into their respective queues (denoted by r∗). As described, reduce∗ tasks are executed locally using locks. New aggregate values are computed and stored into the result table.\n\n
  • Our implementation does not provide fault tolerance guarantees for execution of map/reduce∗ tasks. If the node responsible to execute map fails while the map task is still in the queue, the map task will never be executed. Also, our synchronization mechanism requires intermediate key-value pairs to be routed to a single node. These intermediate pairs might be lost in case of failures. Nevertheless, once a map/ reduce∗ task has been executed successfully the results are stored reliably at a number of replica nodes. Thus, only intermediate data can be lost.\n\nThere are a number of reasons for this design decision. First, for many analytical applications losing intermediate data is not critical. For such applications it is more important to see a general trend rather than exact numbers. Second, only those map/reduce∗ tasks can be lost that wait in the queue at the moment a node fails. If there is no burst of input data, queues are usually empty. Therefore, losing intermediate data happens rarely. Third, the execution of map and reduce∗ tasks is distributed across all nodes of the system. Only a portion of intermediate data will be lost in case a single node fails.\n\nIn order to provide stronger consistency guarantees in case of node failures, we would have to provide exactly-once semantics. Relatively light-weight methods that provide at-least-once semantics are not suitable as repeated executions invalidate aggregate values. Providing exactly-once semantics requires additional storage and computation overhead and is argued to be too expensive and not easy to scale.\n\n\nScalability. In our system, the execution of map and reduce∗ is distributed across the nodes according to the data partitioning strategy of the key-value store. It allows to easily scale the system as execution and data storage are tightly coupled. By default, Cassandra provides a mechanism for scaling the data storage. Any new node is placed near the most loaded node of the system. Parts of the data from the loaded node are transferred to the new node, thus, shedding load between the nodes. We extended Cassandra’s load measurement formula to include execution load as well. As in the SEDA architecture, we use the length of the queue to measure execution load. It is a good criteria because it reflects any bottleneck at a node such as CPU overload or network saturation.\n\n\n
  • \n
  • Yahoo! recently open sourced S4, a system that is close to ours.\n\nWhat are the differences:\n\n1) Triggy has MapReduce programming model many developers are familiar with. Programming model of S4 is more general. \n\n2) Our system is tightly coupled with the database, while S4 process tasks in memory. Why we think database-intensive solution is important:\n\nа) With Triggy, you don’t have to worry about the window. You can compute analytics using historical data which can be used within a window, as well as without a window, or the window can be of different sizes for different parameters. For example, while monitoring user’s browsing behavior using cookies for advertising: some users show enough interest for a certain ad within a short time period, while you can monitor and wait for other users much longer.\n\nб) Triggy is easily scalable. You don’t have to scale the computation separately from the database. Tightly coupled solution allows scaling the system with a single knob.\n\n\n
  • \n
  • News site use real-time analytics for optimizing their sites to attract more readers.\n\n1) A/B testing for headlines of news stories. When the news is first published on the site, there are two different headlines for it. For the first 5 minutes part of the readers get one headline, while another part of the readers gets another headline. Then the headline that attracts more clicks during the first 5 minutes in chosen. \n\n2) Optimizing news layout. The system analyses clicks, likes and retweet to understand which news stories rise discussions in social media. Then put the most discussed news on to the front page to attract even more readers. \n
  • The Twitter - a personalized news service: The Twitter uses your friends relationships on Twitter to recommend news for you.\n\nCurrently, The Twitter newspapers are being rebuilt every 2 hours (batch processing). Would be nice to have push-style processing, when the new news story is coming to the newspaper as soon as it is published on Twitter.\n\n\n
  • \n
  • What is real-time bidding\n\nHere's the basic gist:\n1) Sites across the web track your browsing behavior via cookies and sell basic data about you to Ad Service companies. For example, Google Content Network covers 80% of internet users.\n\n2) Web publishers offer up display inventory to the RTB market through ad services; rather than signing up for a fixed CPM, they sell each individual ad impression to the highest bidder, based on whom that individual ad is being served to. For example, a retailer who agrees to run a display ad campaign for a shoe sale at $5 per 1,000 impressions. That retailer, however, can specify that they will pay $10 per 1,000 impressions for ads that include running shoes if they know that a browser has previously visited the athletics section of its Web site.\n\nReal-time bidding auction is happening during a milliseconds while the site page is opening. Advertisers have to run their algorithms to decide what ad to show and at what price during this time.\n\nGoogle retargeting (or remarketing):\nWhat is remarketing:\nTravel company has a site where they feature the holiday vacations. Users may come to this website, browse the offers and think about booking a trip, but decide that the deal is still not cheap enough. Then, they continue to browse the web. If the travel company later decide to offer discounted deals to the Carribean, it can target the users that already visited their site (interested users) via display ads, that these users will see later on other sites.\n\nAdvertisers can do remarketing after the following events:\n1) User visited your site and left (assume the site is within the Google content network); 2) User visited your site and added products to their shopping cart then left; 3) Go through purchase process but stop somewhere; etc.\n\nThese events can be extended with information from social networks, for example. Suppose, the system can track what the user is posting on twitter and estimate their interest in different products that can be advertised later.\n\nYou can then pay per click for these people as they search and browse the web (ads will be shown in search or content network).  For retargeting you need to aggregate information about a user in a database. Window approach is not applicable here, because there is no a single time frame.\n
  • \n
  • \n
  • Analytics for the Real-Time Web

    1. 1. Analytics for theReal-Time Web Maria Grineva Systems @ ETH Zurich
    2. 2. Real-Time Web• Web 2.0 + mobile devices = Real-Time Web• People share what they do now, discuss breaking news on Twitter, share their current locations on Foursquare...
    3. 3. Analytics for the Real-Time Web: new requirements • Batch processing (MapReduce) is too slow • New requirements: • real-time processing: aggregate values incrementally, as new data arrives • data-base intensive: aggregate values are stored in a database constantly being updated
    4. 4. Our System: Triggy• Based on Cassandra, a distributed key-value store• Provides programming model similar to MapReduce, adapted to push-style processing• Extends Cassandra with • push-style procedures - to immediately propagate the data to computations; • synchronization - to ensure consistency of aggregate results (counters)• Easily scalable
    5. 5. Cassandra Overview Data Model• Data Model: key-value• Extends basic key-value with 2 levels of nesting• Super column - if the second level is presented• Column family ~ table; key-value pair ~ record• Keys are stored ordered
    6. 6. Cassandra Overview Incremental Scalability• Incremental scalability requires mechanism to dynamically partition data over the nodes• Data partitioned by key using consistent hashing• Advantage of consistent hashing: departure or arrival of a node affects only its immediate neighbors, other nodes remain unaffected
    7. 7. Cassandra OverviewLog-Structured Storage• Optimized for write-intensive workloads (log-structured storage)
    8. 8. Triggy Programming Model• Modified MapReduce to support push-style processing• Modified only reduce function: reduce*• reduce* incrementally applies a new input value to an already existing aggregate value Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> (k2, v3)
    9. 9. TriggyProgramming Model
    10. 10. Triggy Synchronization• reduce* functions have to be synchronized for the same key to guarantee correct results• we make use of Cassandra’s partitioning strategy: all keys are routed to the same node• synchronization within a node: locks on keys that are being processed right now
    11. 11. TriggyFault Tolerance and Scalability• No fault tolerance guarantees• Intermediate data and data in queue can be lost• Triggy is easily scalable because the execution and data storage are tightly coupled• A new node is placed near the most loaded node, part of data are transferred
    12. 12. Experiments• Generated workload: tweets with user ids (1 .. 100000) in uniform distribution• The load generator issues as many requests as the system with N can handle• Application: count the number of words posted by each user Map: tweet => (user_id, number_of_words_in_tweet) Reduce: (user_id, numer_of_words_total, number_of_words_in_tweet) => (user_id, number_of_words_total)
    13. 13. Similar Systems: Yahoo!’s S4• Distributed stream processing engine: • Programming interface: Processing Elements written in Java • Data routed between Processing Elements by key • No database. All processing in memory• Used to estimate Click-Through-Rate using user’s behavior within a time window
    14. 14. Similar Systems: Google’s Percolator• Percolator is database-intensive: based on BigTable• BigTable: • the same data model as in Cassandra • the same log-structured storage • BigTable - a distributed system with a master; Cassandra - peer2peer• Percolator extends BigTable with • observers (similar to database triggers for push-style processing) • ACID transactions• Triggy vs. Percolator: • MapReduce programming model • No ACID transactions (intermediate data can be lost) - less overhead. (What is the real overhead of full transaction support? )
    15. 15. Application Social Media Optimization for news sites• A/B testing for headlines of news stories• Optimization of front page to attract more clicks
    16. 16. ApplicationReal-Time News Recommendations • - new recommendations via Twitter’s friends graph • Now - rebuilt every 2 hours; goal - real- time updating newspaper
    17. 17. Application Real-Time Advertising• Real-Time bidding: • Sites track your browsing behavior via cookies and sell it to advertising services • Web publishers offer up display inventory to advertising services • No fixed CPM, instead: each ad impression is sold to the highest bidder• Retargeting (remarketing) • Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere. • Potentially interesting to use information from social networks
    18. 18. Other Applications• Recommendations on location checkins: Foursquare, Facebook places...• Social Games: monitoring events from millions of users in real-time, react in real-time
    19. 19. What otherapplications?