The Web 2.0 era is characterized by the emergence of large amounts of user-generated content. People started generate and contribute data on different Web services: blogs, social networks, Wikipedia. Today, with the emergence of mobile devices constantly connected to the Internet, that nature of user-generated content has changed. Now people contribute more often, with smaller posts and the life-span of these posts became shorter. 1) For example on Twitter people discuss and share breaking news. Lifespan of each tweet is shorter than it is for Blog posts. On Facebook a lot of conversations are happening at real-time. 2) On location-based social networks: Foursquare, Facebook places people share their current location (or checkin) at real venues. This data is real-time sensitive, the user reveals his current location and recommendation of near-by friends and other interesting places must be done immediately, while the user is there.
So far, analyzing and making use of user-generated data has been accomplished using batch-style processing. Data produced over a certain period of time is accumulated and then processed. MapReduce has become the state-of-the-art approach for analytical batch processing of user-generated data. Today, the Web 2.0 data has become more real-time and this change implies new requirements for analytical systems. Processing data in batches is too slow for real-time sensitive data. Accumulated data can lose its importance in several hours or, even, minutes. Therefore, analytical systems must aggregate values in real-time, incrementally, as new data arrives. It follows that workloads are database-intensive because aggregate values are not produced at once, as in batch processing, but stored in a database constantly being updated. For example, Google ’ s new web indexing system, Percolator, is not based on MapReduce anymore. Now, instead of MapReduce, Google uses database-intensive system: Percolator updates the web index incrementally as new documents are crawled.
We are working on a system that can process analytical tasks at real-time for large amounts of data. Our system is based on Cassandra distributed key-value store. We add two extensions into Cassandra in order to turn it into a system for real-time analytics: push-style procedures and serialization. We extend Cassandra with push-style procedures. These procedures act like triggers, you can set it onto a table and they fire up when a new key-value record is inserted. They make the computation real-time, as they immediately propagate the inserted data to the analytical computations. Serialization: Cassandra is a simple key-value store. Incremental update of a value means updating a value based on the existing value. Cassandra does not provide any support for transactions. So, we need to extend it with a mechanism that would provide serialized access to aggregate values. So that updating the same value from several threads work in a serialized consistent way. We add local synchronization into Cassandra, that can synchronize access to a key-value record within a node. Our system provides a programming model similar to MapReduce, adapted to push-style processing, and is scalable in terms of computation and data storage.
Cassandra data model can be described as follows: 1) Cassandra is based on a key-value model A database consists of column families. A column family is a set of key-value pairs. Compared to relational databases, you can think about column family as a table and a key-value pair as a record in a table. 2) Cassandra extends basic key-value model with two levels of nesting At the first level the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns. At the second level, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns with key being the name of the super column and inner key-value pairs are called columns. Let ’ s consider an classical example of Twitter database to demonstrate the points. Column family Tweets contains records representing tweets. The key of a record is of Time UUID type and generated when the tweet is received (we will use this feature in User_Timelines column family below). The record consist of columns (no super columns here). Columns simply represent attributes of tweets. So it is very similar to how one would store it in a relational database. The next example is User_Timelines (i.e. tweets posted by a user). Records are keyed by user IDs (referenced by User_ID columns in Tweets column family). User_Timelines demonstrates how column names can be used to store values – tweet IDs in this case. The type of column names is defined as Time UUID. It means that tweets IDs are kept ordered by the time of posting. That is very useful as we usually want to show the last N tweets for a user. Values of all columns are set to an empty byte array (denoted “ - ” ) as they are not used. To demonstrate super columns let us assume that we want to collect statistics about URLs posted by each user. For that we need to group all the tweets posted by a user by URLs contained in the tweets. It can be stored using super columns as follows. In User_URLs the names of the super columns are used to store URLs and the names of the nested columns are the corresponding tweet IDs.
One of the key features of Cassandra is that it can scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes. Cassandra ’ s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing, the output range of a hash function (which is normally MD5 ) is treated as a fixed circular space or a ring. By this, I mean, that the largest hash value wraps around to the smallest hash value. Each node in the system is assigned a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item ’ s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item ’ s position. The node is the coordinator for this key. Thus, each node becomes responsible for the region in the ring between it and previous node on the ring. The principal advantage of the consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. The problem with MD5 hash function for nodes distribution: the random position assignment of each node on the ring leads to non-uniform load and data distribution. That ’ s why Cassandra analyzes load information on the ring and inserts new nodes near the highly loaded nodes, so that the overloaded node can transfer the data from it onto the new node.
Cassandra is optimized for write-intensive workloads, that is a useful feature for us, as computing aggregate values for analytical tasks implies heavy updates to the system Cassandra uses so called log-structured storage which it inherits from BigTable. The idea is that write operations write to buffer in main memory. When the buffer is full, it is written on disk. So, in the result, the buffer is periodically written on disk organizing a sequence of sstables. And there is a separate thread that periodically merges different versions a sstable. This process is called compaction. Read operation looks up the value first in memtable, then, if it was not found, in different versions of sstable moving from the recent versions to the older versions. Such storage provides maximum optimization for for writes.
MapReduce is a well-established programming model to express analytical applications. To support real-time analytical applications, we modify this programming model to support push-style data processing. In particular, we modify the reduce function. Originally, reduce combined a list of input values into a single aggregate value. Our modified function, reduce∗, incrementally applies a new input value to an already existing aggregate value. This modification allows to apply a new input value to the aggregate value as soon as the new input value is produced. This means, we are able to push new values to the reduce function. Figure 1 depicts our modified programming model. reduce∗ takes as parameters a key, a new value, and the existing aggregate value. It outputs a key-value pair with the same key and the new aggregate value. We did not modify the map function as it is already allows push-style processing. Note that reduce∗ exhibits some limitations in comparison to the original reduce. Not every reduce function can be converted to its incremental counterpart. For example, to compute the median of a set of values, the previous median and new value is not enough to compute the new median. The complete set of values needs to be stored to compute the new median.
Let ’ s see on the example, how the developer would use our system. In order to setup a map/reduce∗ job the developer has to provide implementations for both functions and define the input table, from which the data is fed into map, and the output table, to which the output of reduce∗ is written.
Example: implementation of WordCountMapReducer Reduce takes (1) a word (2) one (3) current counter for the word
Let's discuss how Map and Reduce tasks are executed inside our system. We extended each Cassandra node by adding a queue and worker threads which execute tasks buffered in the queue. Buffering tasks in a queue allows us to get two main benefits: First, it allows us to handle burst of input data (that often happen in social networks). Second, we use the size of the queue as a rough estimation of the load of a node. Now let consider how Map and Reduce task are executed accross the nodes: Let's start with Map Execution: Whenever a new key-value pair is inserted into a table the node handling the insert check for the Map task (specified for this table) and put it into its local queue. A worker thread will execute the map task eventually. Map tasks can be executed in parallel at any node in the system and do not require any serialization because they do not share any data. For example, on this picture there are two inserts that arrive at nodes N1 and N5. The nodes insert the data, put corresponding Map tasks in the queue and than execute them in parallel. The map task executed on N1 produces two key-value pairs (k2,v3) and (k3,v4). And the map task executed on N5 produces one key-value pair (k3,v5). Now let's talk about Reduce Execution: In contrast to map, the execution of reduce needs to be serialized because several reduce tasks can update the same aggregate value in parallel that can lead to inconsistent counters. Cassandra do not provide any serialization mechanisms. We implemented such a mechnism. It works in two steps: (1) We route all key-value pairs output by map with the same key to a single node. Routing is implemented by reusing Cassandra' partitioning strategy (i.e. consistent hashing). (2) Within the node we serialize execution of reduce for the same key using locks. For example, on this picture all pairs with k3 are routed to the node N3. For each pair arrived at the node a Reduce task is put into the local queue. Worker threads are serialized so that only one worker thread executes reduce tasks for a given key. For that we use a lock table that contains keys being processed by each worker. By writing the new aggregate value into the database, the reduce tasks can fire a subsequent map/reduce task.
Our implementation does not provide fault tolerance guarantees for execution of map/reduce∗ tasks. If the node fails we lose map and reduce tasks stored in its local queue. Nevertheless, once a map/ reduce∗ task has been executed successfully the results are stored reliably into the database (at a number of replica nodes). Thus, only intermediate data can be lost. There is a reason for this design decision. For analytical applications losing intermediate data is not critical. For such applications it is more important to see a general trend rather than exact numbers.
Scalability . In our system, the computation and data storage is distributed across the nodes according to the Cassandra partitioning strategy. It means that by moving the data, you move the computation. It allows to scale the system easily. By default, Cassandra provides a mechanism for scaling the data storage. Any new node is placed near the most loaded node of the system. Cassandra estimates the load of the node by the size of the database stored on that node. We extended Cassandra ’ s load measurement formula to include computation load as well. We use the length of the queue to measure computation load. It is a good criteria because it reflects any bottleneck at a node such as CPU overload or network saturation.
This slide contains preliminary experiment results. In this experiment we generated a workload of tweets with user ids in a wide range with uniform distribution. The load generator issues as many requests as the system can handle. We measure the throughput for various number of nodes in the system. Our application counts the number of words posted by each user. You can see the input and output for map and reduce tasks. On the left picture you can see throughput for just inserting tweets into Cassandra. The right picture shows throughput with map and reduce tasks executed. So we can conclude that our system scales as well as Cassandra can scale (why Cassandra does not scale well is another story). The difference in throughput is because we do twice more writes in our application.
Yahoo! recently open sourced S4, a system that is close to ours. In S4 data is handled by Processing Elements implemented in Java. Data are routed between Processing Elements by key: each Processing Element gets all key-value pairs with some key and can output a number of key-value pairs that are routed to the corresponding Processing Elements. There is no database - all processing is done in main memory by supporting the window for the input data at each Processing Element. What are the differences: 1) Triggy has MapReduce programming model many developers are familiar with. Programming model of S4 is more general. 2) Our system is tightly coupled with the database, while S4 process tasks in memory. Why we think database-intensive solution is important: а) With Triggy, you don ’ t have to worry about the window. You can compute analytics using historical data which can be used within a window, as well as without a window, or the window can be of different sizes for different parameters. For example, while monitoring user ’ s browsing behavior using cookies for advertising: some users show enough interest for a certain ad within a short time period, while you can monitor and wait for other users much longer. б) Triggy is easily scalable. You don ’ t have to scale the computation separately from the database. Tightly coupled solution allows scaling the system with a single knob.
Percolator is designed for incremental data processing and is based on BigTable. Here is a quick introduction to BigTable. BigTable is a distributed key-value store. It has the same data model as Cassandra. Cassandra also borrowed log-structured storage from BigTable. The only difference is that BigTable is a system with a master while Cassandra is a peer2peer system. Percolator's application in Google: Within Google, Percolator is now used instead of MapReduce for incremental web index construction. Percolator processes web pages as they are crawled and updates live Web search index.
Percolator extends BigTable with distributed ACID transactions and observers. 1) Distributed ACID transactions are implemented using multi-version mechanism with snapshot isolation semantics. They use two-phase commit protocol to build it over the distributed BigTable database. 2) observers - are similar to database triggers: it is a procedure that fires on a write to the database. 3) Picture - how observer and transactions are used to implement push-style execution Suppose there is Table A and an observer O set on this table. For each observer there is a corresponding notification table Notify O. For each write to Table A, a notification to execute the observer O is inserted into the notification table Notify O. It is executed as a single transaction T1. Workers periodically scan the notification table and handle each notification as follows: It executes the corresponding observer and removes the notification. Observer execution and deletion of the notification are executed as a single transaction T2. Advantages: Using distributed transactions and storing intermediate data into the database garantees that no data will be lost. It means that for each inserted document the associated observers will be executed. In contrast to this, in our system, tasks stored in a queue can be lost in case of falures. But in comparison with our system there is an overhead: it requires expensive distributed transactions and durable storage of intermediate data.
We are trying to understand and classify the applications that can be built upon our system. So, we distinguish the following essential features of our system. And at the same time trying find the applications that would really benefit from these features. Our system is good, when you need to track millions of parameters individually. For example, Web applications need to process millions of users, browser cookies or URLs. Our system computes aggregate values incrementally. So, the application can choose to get the latest updates from the system in push style, as soon as the value is updated. Or it can query the value when it needs to. Our system is based on a database, so you can store as long history of events for each parameter as you want. You don ’ t need to define a time window for monitoring, or you can have windows of different sizes for each parameter. Now let describe the applications that already we have in mind.
What is real-time bidding Here's the basic gist: 1) Sites across the web track your browsing behavior via cookies and sell basic data about you to Ad Service companies. For example, Google Content Network covers 80% of internet users. Every time when the user comes to the one of the sites of Google Content Network, google pushes notification about this event to all of its ad partner companies 2) Ad Service company have to make a decision which ad to show for this user and for which price. They bid for each impression. Real-time bidding auction is happening during milliseconds while the site page is opening. This decision can be made based on cookie history. For example, a retailer agrees to run a display ad campaign for a shoe sale at $5 per 1,000 impressions. However, he can specify that they will pay $10 per 1,000 impressions for ads that include running shoes if they know that a browser has previously visited the sport Web site. 3) In the result, Web publishers sell each individual ad impression to the highest bidder, and get more money and better ad quality. And advertiser has the chance to pay more for ads to show it for important customer. And users get better ad quality. Google retargeting (or remarketing): What is remarketing: Travel company has a site where they feature the holiday vacations. Users may come to this website, browse the offers and think about booking a trip, but decide that the deal is still not cheap enough. Then, they continue to browse the web. If the travel company later decides to offer discounted deals for this hotel, it can target the users that already visited their site (interested users) via display ads, that these users will see later on other sites. Advertisers can do remarketing after the following events: 1) User visited your site and left (assume the site is within the Google content network); 2) User visited your site and added products to their shopping cart then left; 3) Go through purchase process but stop somewhere; etc. These events can be extended with information from social networks, for example. Suppose, the system can track what the user is posting on twitter and estimate their interest in different products that can be advertised later. You can then pay per click for these people as they search and browse the web (ads will be shown in search or content network). For retargeting you need to aggregate information about a user in a database. Window approach is not applicable here, because there is no a single time frame.
Today most of the advertising services use data about clicks and Web browsing. Users Social networks profiles could be a good additional source of information to improve the matching of ads. Together with master student Michael Haspra, we are working on a project where we try to provide real-time ad matching for Twitter users. The system is built upon Triggy. We use Twitter API to receive new tweets from Twitter users as soon as they post something. As an advertisements, we used product descriptions from Amazon.com. We have extracted keywords from product descriptions for each product. With our system, we monitor user ’ s Twitter profiles to watch mentions of product related keywords. For each Twitter user, we monitor their tweets, as well as tweets from their friends. Several studies have shown that mentions from friends matter a lot when making a decision to buy a product. So, we wait until the user has mentioned some of the product related keywords and, probably, his friends also mention them. With the parameter tetta, we try to estimate the readiness of the user to make a purchase. As soon as the total number of mentions by the user and his friends exceeds tetta, we can approach the user with an advertisement.
News site can use real-time analytics for optimizing their sites to attract more readers. 1) A/B testing for choosing a headline for a news story. When the news is first published on the site, there are two different headlines for it. For the first 5 minutes part of the readers get one headline, while another part of the readers gets another headline. Then the headline that attracts more clicks during the first 5 minutes in chosen. 2) Optimizing news layout. The system analyses clicks, likes and retweets to understand which news stories rise discussions in social media. Then they put the most discussed news on to the front page to attract even more readers.
Another example, is news recommendation services. Services like The Twitter Tim.es - a personalized news service: http://twittertim.es. The Twitter Tim.es uses your friends relationships on Twitter to recommend news for you. Currently, The Twitter Tim.es newspapers are being rebuilt every 2 hours (batch processing). Would be nice to have push-style processing, when the new news story is coming to the newspaper as soon as it is published on Twitter. Another example, is Google News. They gather the information about which news the user clicked. And then build personalization based on this information about clicks using collaborative filtering approach. News stories change every few minutes and new information about the user clicks is constantly coming. Currently, they process this data in batches every 2 or 3 hours using MapReduce.
Generated workload: tweets with user ids (1 .. 100000) with uniform distribution
The load generator issues as many requests as the system with N can handle
Application: count the number of words posted by each user Map: tweet => ( user_id , number_of_words_in_tweet ) Reduce: ( user_id , numer_of_words_total , number_of_words_in_tweet ) => ( user_id , number_of_words_total )
Sites track your browsing behavior via cookies and sell it to advertising services
Web publishers offer up display inventory to advertising services
No fixed CPM, instead: each ad impression is sold to the highest bidder
Advertisers can do remarketing after the following events: (1) the user visited your site and left (assume the site is within the Google content network); (2) the user visited your site and added products to their shopping cart then left; 3) went through purchase process but stop somewhere.
Using Social Network Profiles to Enhance Advertising
Watching for readiness for a purchase intent among Twitter users
Application Social Media Optimization for news sites