Introduction. Recommender systems try to deal with the information overload problem. They do that by filtering relevant items out of huge datasets. Sometimes, probably most of the times, these datasets are so large that they no longer fit in the RAM memory of a single computing node. When this happens we need some kind of data approach to be able to cope with this. This is usually where databases come in. But are they always the best option? Or allow me to rephrase, are they the only option? Little anecdote to explain my point. In 2009 when I was working on my master thesis I developed a recommendation system for events. I used a simple MySQL database for storage. At the end of the implementation I ran some performance tests and let me show you, one of these graphs that I came up with. This pie chart represents the total execution time of the recommendation algorithm. The green area is the database connection time and the blue area which only accounts for .5 % percent of the total time is the actual CPU time. Note that this is after optimizing the algorithm with prefetching strategies in separate threads and so on. This graph illustrates the point that I am trying to make, which is that if you are not careful, an ill designed database or for that matter any not properly thought out data strategy may impose a serious bottleneck on the performance of the system. With this in mind, I tried to avoid the use of a database for a project I was working on and I will now show you how I did it.
For this project I used a high-performance computer that is available for researchers at my university. The conceptual layout of this cluster is like this. In total there are 194 computing nodes available each of which contain 8 cores clocked at 2.5 GHz. They each have 16 GB of internal memory and a hard disk of 146 GB. These nodes are all interconnected with an infiniband interface, which is a lot faster than standard Ethernet. Each of these nodes is also connected to a shared storage medium that stores all data in a RAID5 configuration. As you can see deploying a database in this scenario is not a trivial task. So we experimented with a file-based approach that allowed optimal scalability across all the available nodes.
To be able to transform the recommendation process into a scalable process, we must first analyze a common workflow of this process. Taking the union over the most common recommendation algorithms, these three phases can be distinguished. The item similarity, User Similarity and the final recommendation calculation. For a typical recommendation process, item metadata and consumptions may be available. With consumptions we mean in fact, all sorts of feedback both implicit as explicitly gathered. From the item metadata, item similarities can be calculated which in their turn can serve as the input together with the consumptions for the user similarity calculation. When we have the item similarities, the user similarities and the consumptions, the final recommendation calculation can take place. In the next couple of slides I will demonstrate how each of these phases can be processed in a way that allows transparent scaling across computing nodes and uses nothing more than files for storage.
So let’s first zoom in on the item similarity. If we want the item similarities of every item in the system, we have to compare each item with each other item. In this example of 5 items, we only need to compare the half size triangular matrix assuming that item similarity is typical a symmetric relationship. So in our case, 10 comparisons. Now how can we distribute these calculations across different nodes? We first project the job array, snake like, into a one-dimensional array.
Distributing the 10 similarity calculation jobs is now a matter of slicing up the job array in equal parts depending on how many computing nodes are available. Inside the nodes, jobs can be delegated further amongst the available cores. 5 nodes with each two cores in this example. So we managed to distribute the item similarity calculation among different nodes. The item metadata is usually small enough to fit in 16 GB of RAM, so that can just be loaded into every node. The output is less trivial. We experimented with a dataset of 53,000 items , which means over a billion comparisons. The total output of this will of course be huge.
Two obvious approaches for the storing of this intermediate data are the following… We could just write the result of every comparison job to a single file…OR we could group all the results in one single giant file. It is clear that none of these approaches are scalable in the long run…so we adopted a meet in the middle approach using the concept of file buckets. We define a file bucket as a container of individual files. So instead of a single file per comparison, we spread out the output over a number of files we can freely choose. For the spreading of the output over the different buckets we use the modulo function. Doing so ensures an evenly load over the available buckets. A quick example to illustrate. Suppose we have 3 buckets, so we want all the results to be spread out over these three files. As an example we use comparison job 5, which compared the items i2 and i4. Applying the modulo function on the index number of the first item results in the index number of the bucket. So 2 modulo 3 equals 2 which is the index of the file bucket that the output value should be written to. Note that by using this system, we ensure that all the similarities of the same item will end up in the same file bucket.
Now about writing the similarities. If 8 cores in the same node would start writing their output to the same file buckets than that would lock up your filesystem or at least slow it down terribly. We use the following strategy to circumvent this issue. Let’s say these are two computing nodes witch each three cores. First we force every core to write to its own dedicated set of file buckets. Then the file buckets are merged per computing node so that each node has but one set of file buckets. And then finally we merge all the file buckets from all of our computing nodes into one set of file buckets on the shared storage. So that’s how item similarities are calculated, written and stored.
Now we move on to the user similarity. Again a matrix presentation of the problem. note that user similarity is not always a symmetric relationship and therefore we need to compute the entire matrix. Remember that we can have item similarities and user feedback as input to our calculation. For each calculation job the relevant item similarities must be loaded together with the consumptions. Consumption files are really small, so you can load just all the consumptions. The movielens 10M dataset for example contains 10M consumptions from 72,000 users and is only roughly 260 MB large.
For the distribution of the available jobs across the different computing nodes we use a slightly different approach than for the item similarities. To calculate the similarities from user1 to user2, 3 and 4. Some subset of item similarities will need to be loaded based on the consumptions of user 1. This input data will be the same for the jobs 0, 1 and 2 in our example. So it makes sense to load this data once and process the three jobs with the same node. So we split up the jobs on a user level, like this. Different task can then again be spread out over the available cores in each node. We can reuse the gradually merging file buckets strategy to write and store the user similarities just as we did for the item similarities.
Moving on to the recommendation calculation phase. When we have the item similarities, user similarities and consumptions available, we can feed this to the final recommendation calculation. Which is of course the main goal of the entire system. We can define a recommendation task as being a match between a user and an item, and so the matrix presentation of this task would look like this. If we want to know the recommendation value of user1 for item1 then we need the user similarities of user1 and the item similarities of item1 together with the consumptions. And this for every user and item in the system.
Actually, what we need to do, is match all the item similarity information we have stored in our file buckets with the information about the user similarities that we also conveniently stored in file buckets. So with our approach this comes down to matching all the file buckets of the item similarities with those of the user similarities. This can be easily distributed across the available computing nodes because each node can independently process some subset of item and user buckets. This all results finally in the recommendations we were looking for.
We tested our approach on a real dataset that we gathered in a project we did with a cultural events website. From that website we collected 5 months of data which included information about 53,000 items and 1700 users. From these users we collected in total 14,000 feedback indicators. This includes clicks and browsing behavior. These indicators were ultimately combined in 6800 consumptions being ratings from 1 to 5. We generated recommendations for this dataset using different numbers of computing nodes going from 10 up to 160 nodes. Specific numbers are in the paper, but what we ultimately saw, was that the time the process took to calculate the recommendations scaled inversely with the number of nodes and cores we used for the calculation.
So I can conclude my presentation with the observation that we managed to successfully build a file-based approach for a recommender system on the HPC infrastructure provided by our university. We did this by splitting up the general workflow of a recommendation system in three phases each of which we were able to divide up in independent subjobs that allowed easy and scalable distribution over the available computing nodes. The end-result is an approach that is almost embarrassingly parallel, which is a term used to indicate that each job is perfectly independent from the other and calculation time can decrease proportional with the hardware applied. Our approach is memory efficient in the sense that the number of buckets used to store the item and user similarities can be chosen such that it is optimally adapted to the RAM of the computing nodes.
And hereby I conclude my presentation, I hope you found it interesting and if you have any questions, feel free to contact me.
A File-Based Approach for Recommender Systems in High-Performance Computing Environments
A File-based Approach for Recommender Systems in High-Performance Computing Environments Simon Dooms @sidooms
Introduction <ul><li>Is a database always the best option? </li></ul>09/02/2011 Simon Dooms - Ghent University - RSmeetDB '11 Intro Hardware Workflow Item User Calc Results Concl. 0.5% 99.5%
Recommendation workflow Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. Consumptions Item Metadata Item Similarity Calculation Item Similarities Recommendation Calculation User Similarities User Similarity Calculation Consumptions Consumptions Item Similarities Phase 1: Item Similarity Phase 2: User Similarity Phase 3: Recommendation
Item similarity Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. Item Metadata Item Similarity Calculation Item Similarities
Item similarity Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. node node node node node C C C C C C C C C C
File buckets Simon Dooms - Ghent University - RSmeetDB '11 MODULO 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. Example for 3 buckets
Writing item similarities Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. C C C C C C Local Storage Shared Storage
User Similarity Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. Item Similarities User Similarities User Similarity Calculation Consumptions
User Similarity Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. C C C C node node node node
Recommendation calculation Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. User Similarities Recommendation Calculation Consumptions Item Similarities
Recommendation calculation Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl. … Similarities Item Similarities User
Results <ul><li>Proof of concept implementation </li></ul><ul><li>Cultural events dataset </li></ul><ul><ul><li>5 months of data </li></ul></ul><ul><ul><li>53,000 items </li></ul></ul><ul><ul><li>1,700 users </li></ul></ul><ul><ul><li>14,000 => 6,800 consumptions </li></ul></ul>Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 <ul><ul><li>Used number of nodes: 10, 20, 40, 80, 160 </li></ul></ul><ul><ul><li>Execution time scales inversely with number of nodes </li></ul></ul>Intro Hardware Workflow Item User Calc Results Concl.
Conclusion <ul><li>A file-based approach for HPC </li></ul><ul><li>Workflow as independent subjobs </li></ul><ul><li>Workflow ≈ embarrasingly parallel </li></ul><ul><li>Approach both scalable and memory efficient </li></ul>Simon Dooms - Ghent University - RSmeetDB '11 09/02/2011 Intro Hardware Workflow Item User Calc Results Concl.
A File-based Approach for Recommender Systems in High-Performance Computing Environments With the support of IWT Vlaanderen, Stevin Supercomputer Infrastructure at Ghent University, the Hercules Foundation and EWI Simon Dooms @sidooms