Recommender System at Scale Using HBase and Hadoop


Published on

Recommender Systems play a crucial role in a variety of businesses in today`s world. From E-Commerce web sites to News Portals, companies are leveraging data about their users to create a personalizes user experience, gain competitive advantage and eventually drive revenue. Dealing with the sheer quantity of data readily available can be a daunting task by itself. Consider applying machine learning algorithms on top of it and it makes the problem exponentially complex. Fortunately, tools like Hadoop and HBase make this task a little more manageable by taking out some of the complexities of dealing with a large amount of data. In this talk, we will share our success story of building a recommender system for leveraging the Hadoop ecosystem. We will describe the high level architecture of the system and discuss the pros and cons of our design choices. operates at a scale of 100s of millions of users. Building a recommendation engine for entails applying Machine Learning algorithms on terabytes of data and still being able to serve sub-second responses. We will discuss techniques for efficiently and reliably collecting data in near real-time, the notion of offline vs. online processing and most importantly, how HBase perfectly fits the bill by serving as a real-time database as well as input/output for running MapReduce.

Published in: Technology
1 Comment
  • 看了第一遍:重点讲做了什么。没讲难点在哪?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hi everyone. I am Dhaval Shah. I work in the R&D Department at Bloomberg and spend a majority of my time building Recommender and Analytic systems for For those are not aware, Bloomberg, as a company, provides best in class financial data, analytics and news services. The Bloomberg Terminal is a paid product for financial professionals providing them world class tools that help them stay ahead of the competition. provides some of that information free of cost for a common user on the internet. is one of the top 10 highly visited news and data websites in the world and one of the most influential. For the next 30 mins or so, I will be sharing my experience on building Recommender Systems for using the Hadoop ecosystem, majorly Hadoop, HBase and Flume.
  • Just a peek into today’s agenda. We will start off with a short introduction on Recommender Systems, discuss the different types of Recommender Systems and then dive right into the different pieces that together make a Recommender System functional. During this technical deep dive, we will see how the different parts of the Hadoop ecosystem simplify building a Recommender System. And finally, I hope to have lots of questions from you guys.
  • So what is a Recommender System? Anyone from the audience want to help me out on answering this one? Right. So here is a rather complex definition from Wikipedia. Here is how I like to put it. A user interacting with a web site is indirectly telling us something about his or her interests by virtue of what he reads, clicks or shares. We can use this data to understand the user’s interests and better serve the user. A Recommender System is just a fancy name for a system which does this.
  • So where are Recommender Systems used? Can someone from the audience give me some examples? Yup. The short answer is almost everywhere from E-Commerce to Online Radio to Media
  • For example, this is how Amazon uses it to entice users.
  • Here is how IMDB uses Recommenders.
  • Pandora’s business is mainly driven by intelligent use of such systems.
  • And this is a rather incomplete ensemble of websites which effective use Recommender Systems to derive business value.
  • On, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
  • On, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
  • Some of you might be wondering – Why do we need a Recommender System at all? From my own experience and talking to practitioners in the field, the single most important reason is “There is too much useful data out there for any human to make reasonable use of without an automated system”. Lets look at some stats to answer this question in the context of publishes 500-1000 stories and 100-200 videos per day on average. The numbers vary slightly based on the news cycle.Its obvious that no user has the time to read everything published on and the millions of other news sites on the internet. In fact the average consumption is in single digits which is far lesser than the number of articles published. There are two main factors that influence whether a user would read an article or not, content quality and user preference. The editorial team at Bloomberg does an excellent job at producing high quality content. However, when you have 100s of millions of unique visitors a year on your site, no matter how good the editorial staff is, its not humanly possible to manually curate the website and still be relevant to the entire user base. Humans are lazy by nature and would prefer doing as little work as possible in searching or browsing to find relevant content. That’s where modeling the user preference and serving relevant content becomes extremely important to the business. The effect is not immediate but it helps in slowly gaining customer loyalty since the user does not need to spend his valuable time trying to search for relevant content.And most importantly, based on A/B tests, we have seen like 20-30% increases in click through rates on certain modules when we put in Recommendations
  • Recommender systems can be broadly classified into two types – Content based and Collaborative filter based. Content based systems typically try to model a user’s preferences in terms of features or characteristics of the content, solely based on that user’s activities and independent of activities of other users. A simplistic representation of my preferences for example would be that I like technology stories. These kind of systems try to recommend articles similar to the ones the user has viewed in the past. It relies on the assumption that we can extract features for articles which faithfully represent the article. For stories, there are many NLP based algorithms which help us achieve this easily. However, for videos, this is still relatively difficult. So these kind of systems tend to work well for stories but the performance for videos is still an open question and requires substantial effort. On the other hand, Collaborative filter based approaches rely on finding users with similar interests and thus, user models are dependent on the activities of other users. These kind of systems try recommending articles which users similar to the user in question have viewed. It does not rely on features for the article itself and thus can be easily applied to non-text media types like videos and company quotes. However, since every user’s model is dependent on every other user’s activities, these kinds of algorithms are much more computationally expensive and requires massive processing power. Collaborative filters can be further classified into user based collaborative filters and item or resource based collaborative filters. For user based collaborative filters you find a set of similar users based on history and then use the activities of similar users to serve recommendations. Item based collaborative filters can also be described as “People who viewed X also viewed Y” as you can prominently see on sites like then ofcourse you can mix and match various flavors of content based and collaborative filter based algorithms and create hybrid recommendation systems. We have various flavors of all of these kinds of Recommender Systems running live in production at the moment.
  • Now that we have discussed the what, where and why of Recommender Systems, lets get to the fun part – How to build a Recommender System?High level, you can break it up into 6 steps:Collect and/or generate metadata about stories/videos. This step is entirely independent on user interactions data.Next step is to identify and track usersOnce we can identify our users, we need to track their activity on the siteNext step is to organize and store this activityThen we use some Machine Learning to generate user preference modelsAnd finally, we use the models we just created to serve recommendationsFor the rest of the presentation, we will be digging deeper into each of these pieces.
  • Our recommender system lives on a separate infrastructure than the main infrastructure. So the first step is collecting relevant metadata about our articles from the main system. This includes details like URLs, Headlines and so on. We use a combination of Sqoop and some custom scripts running on a cron’d basis to gather this data. For those who haven’t heard of Sqoop, Sqoop is a tool that lets you transfer data between an RDBMS on one side and Hadoop or HBase on the other.For content based recommendation models, we need to extract features from our stories. We use the LDA implementation from Mahout to help us achieve this. For those who haven’t heard of Mahout, it’s a Machine Learning library and many of its algorithms run on top of Hadoop. However, Mahout’s LDA is designed to run on the entire batch of stories and takes a really long time to complete, whereas for a news website like, we have new stories published every few mins and the lifetime of a story is short and hence, the batch LDA isn’t going to serve the purpose. So, we built extensions on top of Mahout’s LDA which allows us to evaluate new documents without going through the entire training process. For new documents this process now completes in a couple of mins instead of hours or days required for a full fledged training.
  • On the user side, the fundamental requirement is to identify and track users. We have two types of users – Registered and Anonymous. For registered users who are logged in, this task is easy and accurate. However, these are a very small percentage of the actual audience. How small? For every registered user on, we have more than a 1000 anonymous users. This underscores the necessity to track anonymous users as well. To track anonymous users, we can use cookie-based or IP based tracking. I will not go into details at this point since this is a fairly standard problem across the industry and the trade-offs between the various solutions are well-known.
  • Next up, we need to collect data about their actions. User interactions can be categorized into Explicit and Implicit. Explicit interactions include actions like Facebook Likes, Linked Shares, Tweets on Twitter and so on. In general this is high quality data but is difficult to collect because of that extra step required from the user. On the other hand, just viewing a story is an Implicit interaction. The quality of data can be slightly lower but is much easier to collect and you can get a lot more data using this approach. From a Machine Learning standpoint, getting as much data as necessary is crucial and thus, Implicit data plays a very important role. For, we use a combination of both Implicit and Explicit data but the Implicit data is the one giving us enough information to even make sense to build such a system.
  • Due to the use of CDNs and caching, tracking of user activity cannot be done at the application servers directly. We use Javascript to get this data from the client browser.The browser tracking request hits an HTTP server which logs the data to a file and returns a dummy response. This ensures fast responses and does not hold up a client connection. More importantly, it allows us to handle a high amount of load and peak traffic periods gracefully, independent of the state of the backend.We use a multi-tier Flume architecture to transfer this data to its final resting place, which is HBase. There is a Flume process monitoring the file which the HTTP server writes to and as soon as new data is written, it cleans it up and transfers it to HBase. Though this process happens asynchronously with respect to the client, the data reaches HBase in a matter of milliseconds. For DR and Failover purposes we have multiple HBase clusters in multiple data centers. We use Flume to write out this user activity data to all of our clusters at the same time which, by the way, is really easy to set up with Flume. Flume provides a certain level of reliability guarantee which helps us avoid data loss. Flume also has plugins for Hadoop and HBase. However, the HBase plugin for Flume lacks certain features and does not handle failures gracefully because of which we had to write our own but that isn’t a terribly difficult thing to do. We have written some custom decorators to parse the HTTP server’s log and store it in a usable format in HBase. We have also built some bot filtering mechanisms into our decorators.
  • Here are some key features of our tracking infrastructure. gets 10s of 1000s of page views per minute. Though we don’t track visits to all pages yet, the amount of data we track is still substantial. We already discussed that the client gets back an instantaneous response since the HTTP server logs to a file and returns a response.Flume’s reliability guarantees enables us to ensure that we don’t lose data even if the backend goes down or is unresponsive for a short period of time.We have multiple tracking servers writing to multiple HBase clusters, all spread across multiple data centers. This sounds like a complex setup but Flume capabilities and proper modularization makes this really simple to setup and maintain. And most importantly, all of this happens in a matter of milliseconds which makes the system look live just as a synchronous mechanism would have.
  • As I mentioned earlier, we use HBase as our backend database to store all our data for this system, including user activity data, user models and article metadata. HBase provides us with the right mix of features, scalability and reliability to suit our needs for this system. Here are some important reasons because of which we decided to go with HBase:HBase is horizontally scalable which allows us to store and process terabytes of data at a reasonable cost.HBase is designed for fault tolerance and automatic recovery from failures. I think this is really important when you scale horizontally because with more machines, the probability of a server going down increases.HBase manages all the headaches of sharding data and automatically managing the shards as data keeps growing.HBase is schema-less and sparse. It doesn’t require you to define the entire schema beforehand and allows you to add columns on the fly with different rows having different columns altogether. This feature greatly simplifies schema design. For example, we can now have each user representing a row and add a column every time a user views a story or a video, which provides a nice natural grouping of data. This is particularly suitable for running MR jobs efficiently on this data.It allows you to perform real-time queries on the vast amount of data and still manage to provide millisecond responses.And it has a unique feature to allow MR jobs to run efficiently on the same data used for serving real time responses. This is probably the most important feature for a Recommender System. This takes away all the complexities of managing separate data stores for batch processing and online queries, greatly simplifying the entire app. Now by doing this, you do run into a risk that your resource intensive batch processes might affect your real-time responses which is why many people in community would recommend not to go down this route. However, if properly configured, this is does not turn out to be a problem in real life. You just need to right level of resource isolation and the right config parameters set. The fact that can serve recommendations within 50ms even when multiple MR jobs are running in the background is testimony that HBase does support these kinds of architectures.
  • Here are some stats on our current usage of HBase:We currently have data for 100s of millions of users in our HBase tables.Each of those 100s of millions of users could have interacted with any of the million stories or videos published by This is where the sparse nature of HBase comes in handy.This sums up to terabytes of data across multiple HBase tables.We use wide tables with the notion of 1 row per user which greatly simplifies our app. Specially with the way MR jobs scan HBase tables, the notion of 1 row per user naturally fits into the paradigm with each call to map getting all details about a single user which wouldn’t be possible if we went with a tall narrow table.Our Recommender System serves a high amount of traffic and is capable of handling a lot more.We do many many HBase queries per request and a good deal of processing per request and still manage to serve recommendations within 50ms on average.And all of this when multiple MR jobs are running on the same HBase tables in the background, reading and writing massive amounts of data to these tables. We will get into details for this in the coming slides.
  • So, now we have all our raw user interaction data and article metadata in our HBase tables. The next step is to train user models using this as training data and store the results back into HBase. We are talking about running Machine Learning algorithms on terabytes of data for 100s of millions of users. This involves a massive amount of IO and processing power. This is where technologies like Hadoop and HBase shine. I wouldn’t even try doing this on any traditional RDBMS based system. For a news website like, timeliness is really important. A news article which is very important at this moment may not hold any importance at all a few hours later. For this reason, we have to train our models multiple times every hour which entails an even greater need for IO and processing power. However, the criticality of this requirement depends on the algorithm which we will discuss in a minute. A few slides back, we categorized recommendation algorithms into content based and collaborative filter based. Lets talk about the user model training for these separately since they have slightly different requirements.
  • As discussed previously, content based systems typically try to model a user’s preferences in terms of features of the content. Moreover, these are generally based solely on that user’s activities and independent of activities of other users. This means that each user’s model can be trained independently of the interactions of other users and only changes when that user has a new interaction assuming the article features remain constant. This problem is very easy and natural to parallelize. We just split the users in some number of buckets and assign each bucket to a mapper. We can write the trained models back to HBase from the mapper directly, completing eliminating the need of a reducer and the sort and shuffle phase involved. Moreover, since this training happens incrementally when a user reads a new article, we can run this every few minutes since the total amount of effort will almost be the same and running it more frequently will give us fresher models. For returning users where we have a substantial history, the models remain fairly constant and might not give us much. However for relatively new users, this is a great deal since we now have A model for the user rather than having none which is a huge deal. On, we run this training every 5 mins and train about 5000 user models on every run which comes out to about 1000 user models a minute on average.
  • In contrast to content based approaches, collaborative filter based approaches rely on finding users with similar interests and thus, a user’s model is dependent on the activities of other users. This means that every time any user views an article, all user models are potentially outdated. This necessitates that we train user models for potentially all users every few mins which as you can imagine is computationally very expensive. At a high level, this algorithms requires you to compare each user with every other user to find similar users which would probably take forever to complete. Thankfully, for a news website like, even though old history is useful to build user models, only the latest data is necessary to serve recommendations which simplifies the problem a little bit. We use a map side join mechanism to realize this self join. Again there is no reducer to save time on the shuffle and sort phase. The training has to happen in a batch for all users and we train 10s of millions of user models multiple times in an hour.
  • At this point, all necessary data required for recommendations is available in HBase. The final piece of the puzzle is the real-time piece when a client requests for a recommendation and the response needs to be served within milliseconds. When the application server receives the request for recommendations, it runs multiple queries against HBase to get all the data it needs, runs some Machine Learning evaluation steps on the models it created in the background and serves the top ranking articles based on the evaluation. Compared to the training, this is computationally very cheap and hence can be done in real time. However, there is still some decent amount of processing that happens to complete this evaluation and rank the articles. We leverage in-memory caching on the application servers for speed. Our current production system is able to serve atleast 10s of 1000s of requests per minute without the in-memory cache layer and potentially a lot more with it. Our average response times are less than 50ms for all of our recommendation algorithms.
  • To summarize the talk, Recommender Systems are really important in today’s world where the user base for most online businesses is huge and at the same time there is a need to serve each user personally. Recommender systems come in 2 flavors – Content based and collaborative filter based and the collaborative filter based algorithms can be classified into user based and item based. Building a recommender system requires cross-domain expertise, specially in the fields of Machine Learning and Big Data. Hadoop’sMapReduce framework provides a solid ground for parallel processing and helps simplify building the offline components of a Recommender system. And most importantly, HBase can be used as a massively scalable distributed hybrid data store. I call it hybrid because it can be used for online queries as well as a source and sink for offline MapReduce jobs.
  • Iif you found any of this interesting and would like to get your hands dirty on some of these systems, please get in touch with me after the presentation or email me at
  • Questions?
  • Recommender System at Scale Using HBase and Hadoop

    1. 1. Dhaval Shah R&D Software Engineer, Bloomberg L. P. Recommender Systems at scale using HBase and Hadoop 1Bloomberg
    2. 2. Agenda  Introduction to Recommender Systems  Types of Recommender Systems  Building a Recommender System  Summary  (Hopefully) Lots of Q&A 2Bloomberg
    3. 3.  What is a Recommender System?  Wikipedia1 – Recommender systems are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item or social element they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user’s social environment (collaborative filtering approaches). 3 Introduction to Recommender Systems Bloomberg
    4. 4. Introduction to Recommender Systems  Where are Recommender Systems used?  Everywhere! (Well almost!)  E-Commerce  Web Portals  Online Radio  Streaming Movies  Media/News 4Bloomberg
    5. 5. 5
    6. 6. 6
    7. 7. 7
    8. 8. Introduction to Recommender Systems Bloomberg
    9. 9. 9
    10. 10. 10
    11. 11.  Why do you need a Recommender System?  Too much useful information  statistics o 500-1000 stories, 100-200 videos published per day o Average user consumption << Articles published o Satisfied user = Content Quality + User preference o Double digit increases in CTR 11 Introduction to Recommender Systems Bloomberg
    12. 12. Types of Recommender Systems  Content-Based  Collaborative filter based  User-based  Item-based  Hybrid 12Bloomberg
    13. 13. Building a Recommender System  Collect/Generate metadata about stories/videos  Identify and track users  Track user activity  Store user activity  Generate user models  Serve recommendations 13Bloomberg
    14. 14.  Collect metadata about stories/videos  URLs, Headlines, etc.  Sqoop, Custom Scripts  Generate features for stories  LDA from Mahout  Custom extensions 14Bloomberg Building a Recommender System
    15. 15.  Identify and track users  Registered  Anonymous o Cookie based tracking o IP based tracking 15Bloomberg Building a Recommender System
    16. 16.  Types of user activity  Explicit interactions  Implicit interactions 16Bloomberg Building a Recommender System
    17. 17. Tracking user activity 17Bloomberg Building a Recommender System Browser (Javascript) HTTP Server D Flume HBase
    18. 18.  Tracking : Key Features  1000s of ppm  Asynchronous - Instantaneous responses to client  Reliability  Multiple HTTP Servers → Multiple Clusters  Client to HBase in milliseconds 18Bloomberg Building a Recommender System
    19. 19.  Why HBase?  Scalable  Fault-tolerant  Auto-sharding  Schema-less and sparse  Real-time queries  MR integration 19Bloomberg Building a Recommender System
    20. 20.  Store user activity  100s of millions of users  Millions of stories/videos  TBs of data  Wide Tables – 1 row per user  High load  Sub-second response times  Multiple MR jobs every few mins 20Bloomberg Building a Recommender System
    21. 21.  Generate user models using ML  100s of millions of users  High IO/Processing power  Train multiple times an hour 21Bloomberg Building a Recommender System
    22. 22.  Content-based Recommender Models  User model independent of other users  Train only when user has new interaction  Easily parallelizable  No Reducer  Incremental training  Train 1000 user models a minute 22Bloomberg Building a Recommender System
    23. 23.  Collaborative filter based Recommender Models  User model dependent of other users  Train all models frequently  Map side self join  No Reducer  Batch training  Train 10s of millions of user models on each batch 23Bloomberg Building a Recommender System
    24. 24.  Serve recommendations  Query HBase  Evaluate articles against user models  In-memory cache  1000s of requests per minute  50ms responses 24Bloomberg Building a Recommender System
    25. 25. Summary  Recommender System are important  Content based and Collaborative filter based  Cross domain expertise – Big Data, Machine Learning  Hadoop/MapReduce for offline components  HBase as a hybrid data store 25Bloomberg
    26. 26. Hiring 26 Email: Bloomberg
    27. 27. Questions? 27Bloomberg