I am Simon Dooms and I work at Ghent University in Belgium together with Toon De Pessemier and our supervisor Luc Martens. Our presentation is about a user-centric evaluation experiment where we wanted to find an optimal recommender system for a big events website in Belgium.
So we had this Belgium website that contained over 30,000 cultural events including movie releases, theater shows, exhibitions, fairs and so on. How do we help people that are browsing the website to find the stuff they are really interested in without searching for it themselves? For that we use a recommender system. But as Guy Shani put it on the tutorial in recsys last year (2010), there is no silver bullet. There is no universal recommender system that we can put into this website that is guaranteed to deliver the best result. So to find out what system and more specifically, what algorithm would be best for this situation, we need to try out some and evaluate them in some way. We opted for a user-centric evaluation with 5 different recommender algorithms and this presentation is about how we did this and what results we came up with.
Let me first show you the timeline of the experiment. Suppose that this arrow reflects the duration of the experiment from beginning till the end. First thing we did was recruiting users for our experiment. We did this by sending out an invitation mail to all subscribers of the newsletter and we put banners on the website to attract people’s attention. If someone wanted to take part in the experiment, they had to create an account on the website and click a checkbox that indicated we could use their browsing data as a bases for recommendations. We explained them that we would track their behavior on the website for at least 30 days and would then generate some recommendations based on that data. They would then be asked to fill out a questionnaire regarding the quality of the recommendations. After we started recruiting users for the experiment, we started tracking them on the website and logged any data we found relevant. 28 days after the first invitation we send out a reminder to the users that registered for the experiment, but had not been active on the website so far. 41 days after the first mail we wrapped up our data and … used this as input to 5 different recommendation algorithms. Some days later we alerted the users of the availability of their recommendations and they were asked to complete a questionnaire about the quality of them. Again we reminded the non-responsive users by mail. 56 days after the start of the experiment, we closed the online questionnaire and started analyzing the results. But before I get into that, first some more details about the setup of the experiment.
Feedback. So we tracked our users over a period of 41 days. In that period we wanted to learn about their preferences and behavioral patterns, so let me show you exactly what we logged. This is what an event detail page looks like. Every one of the 30,000 events has a page like this. It contains some detailed information about the event itself, like the title, short description, date, location, prices and so on. Now, some activities you can do on this page, actually indicate a user preference for this event. Like for example, clicking the like button, the share on Facebook and Twitter buttons, mailing this event to a friend, printing this event, looking at the itinerary, asking for more dates and locations of this event and finally clicking this link which will show even more detailed info about the event. Every one of these activities indicates some user preference towards this event. If you click on like we are absolutely certain that a user likes this, but what if he mails or prints this event? We have tried to put a value on every possible feedback indicator ranging from 1 if we are absolutely sure that a user likes the event, down to .3 if we are far less sure. We assigned this .3 value to the activity of browsing to this event page. To aggregate multiple feedback values that a user may have expressed on the same event, we used the max function. This means that if a user first browses to this event page, the system will log an interest of .3 for this event, if he also prints this event, the system will log a .6 feedback value. So the maximum value is always the final value. So that’s what we did the first 41 days. We logged all this data.
Then we wanted to use this data as input for some recommendation algorithms. Remember that we wanted to try out multiple algorithms and compare them in a user-centric way. So we took 5 very common recommendation algorithms and provided each of these with the collected input. Because the setting of this online experiment allowed the gathering of both user feedback and item description data, we were able to implement both content-based and collaborative filtering algorithms and that also gave us the opportunity to implement a simple hybrid recommender which combines the best recommendations of both. Every algorithm was asked to generate 8 recommendations for every user in the experiment. When this was finished we matched every user randomly to any of the algorithms that was able to generate a list of at least 8 recommendations. We had to be careful, because not every algorithm was able to generate such recommendations for every user. If for example a user did not show a sufficient amount of overlap with other users, then user-based collaborative filtering will have a hard time recommending something to this user. By involving the random recommender in the experiment, there was always at least 1 algorithm that was able to provide recommendations for every user, and it of course also provided a nice baseline for the comparison with the other algorithms.
When the recommendations were in place, the participants of the experiment were asked to fill out a questionnaire about the quality of the recommendations they got. To come up with good questions about the quality of the recommendation system, we looked into the ResQue framework that Pearl Pu presented last year (2010) in this very workshop. We selected a total of 14 questions regarding various topics that we found to be of interest. Specifically for this research there were 8 relevant ones. You can find them in the paper. Here we show the qualitative attributes that the questions are addressing. Our main goal was to found out which recommender would have the highest user satisfaction and if possible why? This is something that we cannot learn from offline analysis using metrics like precision and recall.
First about the number of users. We sent out invitations for this experiment to almost 60 thousand registered users of the website. We had an initial response of about 1%. So about 600 users indicated that they wanted to partake in the experiment. We logged all of these users for 41 days and generated recommendations for them. Unfortunately of these 612 users, only 232 actually filled out the questionnaire. To prevent any sort of bias we only wanted to consider users that every algorithm was able to generate recommendations for. In that case every user has the same chance of being matched with any of the 5 available algorithms. The downside of this was that we had to eliminate another 39 users. These 193 users were randomly matched with a recommendation algorithm, and each of these filled out the questionnaire.
A very easy and visual way to see these responses is to show the averages of every question and for every algorithm. So at the x-axis we find the questions, at the y-axis the average score that was given. Note that the questions about Diversity and Transparency were in a reverse scale, so lower means more. By just looking at these average scores we can already distinguish a clear overall winner. The green color reaches the highest values for almost every question except for the familiarity and the diversity. The winner, so to say, for diversity clearly was the random algorithm which is of course all but a shocking fact. If we were to appoint a runner-up second best algorithm, it would probably be the purple/pink color which stands for the user-based collaborative filtering. This makes sense of course, since the hybrid is a merge of UBCF and CB. Another observation we can make is that SVD is actually a clear loser according to this chart. If you pay attention to the turquoise color you can see that it almost always ranks lowest together with the random recommender. More surprisingly in the case of the accuracy question it ranks EVEN LOWER than the random recommender. So people were under the impression that the random recommender was more able to provide accurate suggestions than the SVD algorithm was. I have to add however, that this result was as you can see from the confidence intervals not found to be statistically significant. Still funny though. We looked closer into this observation and found out that the answers to the questions about the SVD algorithm were widely separated between very good and very bad. So by average this will show pretty average (bad) results. The low scores that SVD obtained may be coming from users with limited profiles and so, but we have not yet fully explored this idea.
The last thing I want to show you with this graph is that there is some hint of correlation between the questions. If you look at the Accuracy question and the Satisfaction question for example, these are probably highly correlated because they look almost identical.
Now we thought it would be interesting to find out which questions were correlated and which weren’t and so we computed the complete correlation matrix. This is what it looks like. As a correlation metric we used the two-tailed Pearson correlation, so values are between 1 and minus 1. A zero value indicates no correlation, 1 and -1 indicate positive and negative correlations respectively. Our suspicion that accuracy and satisfaction were highly correlated now turns out to be true. If we look more closely at the correlation values we can in fact see that question Q8 (satisfaction) correlates with most of the other questions except for Q5 which dealt with diversity. A similar trend can be noticed for Q10 and Q13. When we look at the correlation values of the diversity question Q5, we see a different situation. It turns out that the answers to the diversity question were completely unrelated with any other question in the experiment. We found this to be a rather surprising observation. We must be careful to not confuse correlation with causality but still, data suggests a strong relation between user satisfaction, accuracy, and trust. To get one step closer to causality we performed a simple regression analysis.
In such an analysis we try to predict an attribute by using all the other ones as input to the regression function. You can find more details in the paper about the specific method we used, for now I will just show you the results. At the left side of the arrow is the attribute we are trying to predict, at the right side the attributes that the regression method came up with. Each of these attributes had their own coefficients of course, but I left them out to simplify. The R squared between brackets is called the coefficient of determination and it indicates how well the proposed model fits, values between 0 and 1. 1 being a perfect fit. Let’s zoom in on the most relevant line which is the line for question Q8 where the user satisfaction is predicted. This shows that we can in fact predict the user satisfaction based on accuracy and transparency. We consider the Q10 and Q13 questions about trust and usefulness more as a result of Q8 rather than influencers. This would in fact explain the remarkable low scores of the SVD algorithm in our experiment. Because the inner workings of the SVD algorithm are the most obscure (the most black box), this algorithm will have a low transparency and therefore a low user satisfaction. If we look at how the diversity (Q5) was predicted, we notice the same trend as we did on the last slide. It seems that diversity can not be correlated with any other attribute.
Time to conclude what we have learned. We started off describing our online user-centric evaluation experiment where we implemented 5 popular recommendation algorithms and had users evaluate them by means of a questionnaire based on the framework of Pearl Pu. The hybrid recommender which combined content-based and collaborative filtering recommendations turned out to be the overall best algorithm. While SVD surprisingly came up last, sometimes even after the random recommender. We came up with two possible explanations for this observation. One is that opinions about the algorithm were divided between very good and very bad, leaving only an average end result. The very bad opinions may then have been caused by insufficient user profiles. For the second reason, I need my next bullet point which is that a combination of accuracy and transparency seemed to be defining influencers of the user satisfaction in the end. If we keep in mind that SVD is a very black box type of algorithm then it is clear that its transparency will be very low and therefore possibly the user satisfaction linked to that. And finally to conclude or conclusion it seems like the users of the experiment did not value the diversity of a recommendation list a much at the other aspects of the recommendation system. We are planning to explore this further by means of some focus groups allowing us to focus more on the reasoning behind some of the results we have presented today.
And hereby I conclude my presentation, I hope you found it interesting and if you have any questions, feel free to contact me.
A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System
A User-centric Evaluation of Recommender Algorithms for an Event Recommendation System Simon Dooms, Toon De Pessemier, Luc Martens @sidooms
The Experiment Concl. Results Questions Algorithms Feedback Intro Experiment 10/23/2011 Simon Dooms - Ghent University - UCERSTI 2 Invitation mail DAY 1 Closed questionnaire DAY 56 DAY 41 End tracking DAY 45 Send out recs DAY 50 Reminder mail Reminder mail DAY 28
Feedback Concl. Results Questions Algorithms Feedback Experiment Intro 10/23/2011 Simon Dooms - Ghent University - UCERSTI 2