Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. A Bayesian Approach toward Active Learning for Collaborative Filtering Rong Jin Luo Si Language Technology Institute Department of Computer Science School of Computer Science Michigan State University Carnegie Mellon University Pittsburgh, PA 15232 Abstract the fact that contents are difficult for a computer to analyze (e.g. music and videos). One of the key issues in Collaborative filtering is a useful technique for the collaborative filtering is to identify the group of users exploiting the preference patterns of a group of who share the similar interests as the active user. Usually, users to predict the utility of items for the active the similarity between users are measured based on their user. In general, the performance of collaborative ratings over the same set of items. Therefore, to filtering depends on the number of rated accurately identify users that share similar interests as the examples given by the active user. The more the active user, a reasonably large number of ratings from the number of rated examples given by the active active user are usually required. However, few users are user, the more accurate the predicted ratings will willing to provide ratings for a large amount of items. be. Active learning provides an effective way to Active learning methods provide a solution to this acquire the most informative rated examples problem by acquiring the ratings from an active user that from active users. Previous work on active are most useful in determining his/her interests. Instead of learning for collaborative filtering only considers randomly selecting an item for soliciting the rating from the expected loss function based on the estimated the active user, for most active learning methods, items model, which can be misleading when the are selected to maximize the expected reduction in the estimated model is inaccurate. This paper takes predefined loss function. The commonly used loss one step further by taking into account of the functions include the entropy of model distribution and posterior distribution of the estimated model, the prediction error. In the paper by Yu et. al. (2003), the which results in more robust active learning expected reduction in the entropy of the model algorithm. Empirical studies with datasets of distribution is used to select the most informative item for movie ratings show that when the number of the active user. Boutilier et. al. (2003) applies the metric ratings from the active user is restricted to be of expected value of utility to find the most informative small, active learning methods only based on the item for soliciting the rating, which is essentially to find estimated model don’t perform well while the the item that leads to the most significant change in the active learning method using the model highest expected ratings. distribution achieves substantially better performance. One problem with the previous work on the active learning for collaborative filtering is that computation of expected loss is based only on the estimated model. This 1. Introduction can be dangerous when the number of rated examples given by the active user is small and as a result the The rapid growth of the information on the Internet estimated model is usually far from being accurate. A demands intelligent information agent that can sift better strategy for active learning is to take into account of through all the available information and find out the the model uncertainty by averaging the expected loss most valuable to us. Collaborative filtering exploits the function over the posterior distribution of models. With preference patterns of a group of users to predict the the full Bayesian treatment, we will be able to avoid the utility of items for an active user. Compared to content- problem caused by the large variance in the model based filtering approaches, collaborative filtering systems distribution. Many studies have been done on the active have advantages in the environments where the contents learning to take into account of the model uncertainty. of items are not available due to either a privacy issue or The method of query by committee (Seung et. al., 1992;
  2. 2. Freud et. al., 1996) simulates the posterior distribution of preference factors (Hofmann & Puzicha 1999; Hofmann, models by constructing an ensemble of models and the 2003). The latent class variable z ∈ Z = {z1 , z 2 ,....., z K } example with the largest uncertainty in prediction is is associated with each pair of a user and an item. The selected for user’s feedback. In the work by Tong and aspect model assumes that users and items are Koller (2000), a full Bayesian analysis of the active independent from each other given the latent class learning for parameter estimation in Bayesian Networks is variable. Thus, the probability for each observation tuple used, which takes into account of the model uncertainty in ( x, y, r ) is calculated as follows: computing the loss function. In this paper, we will apply the full Bayesian analysis to the active learning for collaborative filtering. Particularly, in order to simplify p (r | x, y ) = ∑ p (r | z , x ) p ( z | y ) z∈Z (1) the computation, we approximate the posterior where p(z|y) stands for the likelihood for user y to be in distribution of model with a simple Dirichlet distribution, class z and p(r|z,x) stands for the likelihood of assigning which leads to an analytic expression for the expected item x with rating r by users in class z. In order to achieve loss function. better performance, the ratings of each user are normalized to be a normal distribution with zero mean The rest of the paper is arranged as follows: Section 2 and variance as 1 (Hofmann, 2003). The parameter p(r| describes the related work in both collaborative filtering z,x) is approximated as a Gaussian distribution N ( µ z , σ z ) and active learning. Section 3 discusses the proposed and p(z|y) as a multinomial distribution. active learning algorithm for collaborative filtering. The experiments are explained and discussed in Section 4. Personality diagnosis approach treats each user in the Section 5 concludes this work and the future work. training database as an individual model. To predicate the rating of the active user on certain items, we first compute the likelihood for the active user to be in the ‘model’ of 2. Related Work each training use. The ratings from the training user on In this section, we will first briefly discuss the previous the same items are then weighted by the computed work on collaborative filtering, followed by the previous likelihood. The weighted average is used as the estimation work on active filtering. The previous work on active of ratings for the active user. By assuming that the learning for collaborative filtering will be discussed at the observed rating for the active user y’ on an item x is end of this section. assumed to be drawn from an independent normal True distribution with the mean as the true rating as R y ' ( x) , 2.1 Previous Work on Collaborative Filtering we have Most collaborative filtering methods fall into two − ( R y ' ( x ) − RTrue ( x ))2 2σ 2 (2) p ( R y ' ( x) | R True ( x)) ∝ e y' y' categories: Memory-based algorithms and Model-based algorithms (Breese et al. 1998). Memory-based where the standard deviation σ is set to be constant 1 in algorithms store rating examples of users in a training our experiments. The probability for the active user y’ to database. In the predicating phase, they predict the ratings be in the model of any user y in the training database can of an active user based on the corresponding ratings of the be written as: users in the training database that are similar to the active ∏e − ( Ry ( x ) − R y ' ( x )) 2 2σ 2 user. In contrast, model-based algorithms construct p( y'| y) ∝ (3) models that well explain the rating examples from the x∈X ( y t ) training database and apply the estimated model to predict Finally, the likelihood for the active user y’ to rate an the ratings for active users. Both types of approaches unseen item x as r is computed as: have been shown to be effective for collaborative filtering. In this subsection, we focus on the model-based collaborative filtering approaches, including the Aspect p( R y ' ( x) = r ) ∝ ∑ p( R y y' | R y )e − ( R y ( x ) − r ) 2 2σ 2 (4) Model (AM), the Personality Diagnosis (PD) and the Flexible Mixture Model (FMM). Previous empirical studies have shown that the personality diagnosis method is able to outperform For the convenience of discussion, we will first introduce several other approaches for collaborative filtering the annotation. Let items denoted by (Pennock et al., 2000). X = {x1 , x 2 ,......, x M } , users denoted by Y = { y1 , y 2 ,......, y N } , and the range of ratings denoted Flexible Mixture Model introduces two sets of hidden by {1,..., R} . A tuple ( x, y, r ) means that rating r is variables {z x , z y } , with z x for the class of items and z y assigned to item x by user y . Let X ( y ) denote the set for the class of users (Si and Jin, 2003; Jin et. al., 2003). of items rated by user y, and R y (x) stand for and the Similar to aspect model, the probability for each observed rating of item x by user y, respectively. tuple ( x, y, r ) is factorized into a sum over different classes for items and users, i.e., Aspect model is a probabilistic latent space model, which models individual preferences as a convex combination of 2
  3. 3. p ( x, y , r ) = ∑ p( z x ) P( z y ) P( x | z x ) P( y | z y ) P(r | z x , z y ) (5) Correct Decision Boundary zx , z y f1 where parameter p(y|zy) stands for the likelihood for user Estimated Decision Boundary y to be in class zy, p(x|zx) stands for the likelihood for item x to be in class zx, and p(r|zx, zy) stands for the likelihood for users in class zy to rate items in class zx as r. All the parameters are estimated using Expectation Maximization algorithm (EM) (Dumpster et. al., 1976). The multiple- cause vector quantization (MCVQ) model (Boutilier and Zemel, 2003) uses the similar idea for collaborative filtering. 2.2 Previous Work on Active Learning The goal of active learning is to learn the correct model using only a small number of labeled examples. The f2 general approach is to find the example from the pool of unlabeled data that gives the largest reduction in the Figure 1: A learning scenario when the estimated expected loss function. The loss functions used by most model distribution is incorrect. The vertical line active learning methods can be categorized into two corresponds t the correct decision boundary and the groups: the loss functions based on model uncertainty and horizontal line corresponds to the estimated decision the loss function based on prediction errors. For the first boundary. type of loss functions, the goal is to achieve the largest reduction ratio in the space of hypothesis. One commonly examples, the most likely decision boundary is the used loss function is the entropy of the model distribution. horizontal line (i.e., the dash line) while the true decision Methods within this category usually select the example boundary is a vertical line (i.e., the dot line). If we only for which the model has the largest uncertainty in rely on the estimated model for computing the expected predicting the label (Seung et. al., 1992; Freud et. al, loss, the examples that will be selected for user’s 1997; Abe and Mamitsuka, 1998; Campbell et. al., 2000; feedback are most likely from the dot-shaded areas, which Tong and Koller, 2002). The second type of loss functions will not be very effective in adjusting the estimated involved the prediction errors. The largest reduction in the decision boundary (i.e. the horizontal line) to the correct volume of hypothesis space may not necessary be decision boundary (i.e. the vertical. On the other hand, if effective in cutting the prediction error. Freud et. al. we can use the model distribution for estimating the showed an example in (1997) , in which the largest expected loss, we will be able to adjust the decision reduction in the space of hypothesis does not bring the boundary more effectively since decision boundaries optimal improvement to the reduction of prediction errors. other than the estimated one (i.e. horizontal line) have Empirical studies in Bayesian Network (Tong and Koller, been considered in the computation of expected loss. 2000) and text categorization (Roy and MaCallum, 2001) There have been several studies on active learning that have shown that using the loss function that directly use the posterior distribution of models for estimating the targets on the prediction error is able to achieve better expected loss. The query by committee approach performance than the loss function that is only based on simulates the model distribution by sampling a set of the model uncertainty. The commonly used loss function models out of the posterior distribution. In the work of within this category is the entropy of the distribution of active learning for parameter estimation in Bayesian predicted labels. Network, a Dirichlet distribution for the parameters is used to estimate the change in the entropy function. In addition to the choice of loss function, how to estimate However, the general difficulty with the full Bayesian the expected loss is another important issue for active analysis for active learning is the computational learning. Many active learning methods compute the complexity. For complicated models, usually it is rather expected loss only based on the currently estimated model difficult to obtain the model posterior distribution, which without taking into account of the model uncertainty. leads to the sampling approaches such as Markov Chain Even though this simple strategy works fine for many Mote Carlo (MCMC) and Gibbs sampling. In this paper, applications, it can be very misleading, particularly when we follow an idea similar to the variational approach the estimated model is far from the true model. As an (Jordan et al., 1999). Instead of accurately computing the example, considering learning a classification model for posteriors, we approximate the posterior distribution with the data distribution in Figure 1, where spheres represent an analytic solution and apply it to estimate the expected data points of one class and stars represent data points of loss. Compared to the sampling approaches, this approach the other class. The four labeled examples are highlighted substantially simplifies the computation by avoiding by the line-shaded ellipsis. Based on these four training 3
  4. 4. generating a large number of models and calculating the loss function values of all the generated models. θ* = arg max θ∈Θ ∏ p (r | x, y ';θ ) x∈X ( y ') (6) 2.3 Previous Work on Active Learning for = arg max θ∈Θ ∏ ∑ p(r | z, x)θ z x∈X ( y ') z∈Z Collaborative Filtering There have been only few studies on active learning for Then, the posterior distribution of model, collaborative filtering. In the paper by Kai et al (2003), a p (θ | D( y ')) where D( y ') includes all the rated method similar to Personality Diagnosis (PD) is used for items given the active user y’, can be written as: collaborative filtering. Each example is selected for user’s feedback in order to reduce the entropy of the like- p (θ | D( y ')) ∝ p ( D( y ') | θ) p (θ) mindness distribution or p ( y | y ') in Equation (3)). In the paper by Boutilier et al (2003), a method similar to the ≈ ∏ ∑ p(r | z, x)θ z (7) x∈X ( y ') z∈Z Flexible Mixture Model (FMM) named ‘multiple-cause vector quantification’ is used for collaborative filtering. In above, a uniform prior distribution is applied Unlike most other collaborative filtering research, which and is simply ignored. The posterior in (7) is try to minimize the prediction error, this paper only rather difficult to be used for estimating the concerns with the items that are strongly recommended by expected loss due to the multiple products and lack the system. The loss function is based on the expected of normalization. value of information (EVOI), which is computed based on the currently estimated model. One problem with these two approaches is that, the expected loss is computed only To approximate the posterior distribution, we will based on a single model, namely the currently estimated consider the expansion of the posterior function model. This is especially important, when the number of around the maximum point. Let θ* stand for the rated examples given by the active user is small and maximal parameters that are obtained from EM meanwhile the number of parameters to be estimated is algorithm by maximizing Equation (6). Let’s large. In the experiments, we will show that the selection consider the ratio of log p (θ | D( y ')) with respect based on the expected loss function with only the to log p (θ* | D( y ')) , which can be written as: currently estimated model can be even worse than simple random selection. p (θ | S ) ∑ p(ri | xi , z )θz log = ∑log z p (θ * | S ) i ∑ p(ri | xi , z )θz* z 3. A Bayesian Approach Toward Active p (ri | xi , z )θz* θ Learning for Collaborative Filtering ≥ ∑∑ log z (8) i z ∑ p(ri | xi , z )θz*' θz* z' ' In this section, we will first discuss the θz approximated analytic solution for the posterior = ∑(a z −1) log θz* distribution. Then, the approximated posterior z distribution will be used to estimate the expected loss. where the variable α z is defined as p( ri | xi , z )θ z* 3.1 Approximate the Model Posterior Distribution αz = ∑ +1 (9) i ∑ p(ri | xi , z ')θ z*' z' For the sake of simplicity, we will focus on the active learning of aspect model for collaborative Therefore, according to Equation (8), the filtering. However, the discussion in this section approximated posterior distribution p(θ | D( y ')) can be easily extended to other models for is a Dirichlet with hyper parameters α z defined in collaborative filtering. Equation (9), or As described in Equation (1), each conditional Γ(α * ) (10 p (θ | D) ≈ ∏ z θ α z −1 likelihood p (r | x, y ) is decomposed as the sum of ∏z Γ(α z ) ) ∑ z p(r | z, x) p( z | y) . For the active user y’, the most important task is to determine its user type, where α = ∑ z α z . Furthermore, it is easy to see * or p ( z | y ') . Let θ stands for the parameter space θ = {θ z = p ( z | x)}z . Usually, parameters θ are * αz −1 θz (11 estimated through the maximum likelihood = * ) estimation, i.e. αz' −1 θz' 4
  5. 5. The above relation is true because the optimal optimization problem in Equation (13) is not solution obtained by EM algorithm is actual fixed appropriate for active learning of collaborative point of the following equation: filtering. p( ri | xi , z )θ z* In the ideal case, if the true model θtrue is given, θ z* = ∑ (12 i ∑ p(ri | xi , z ')θ z*' ) the optimal strategy in selecting example should z' be: Based on the property in Equation (11), we have (14 the optimal point derived from the approximated x* = arg min − x∈X ∑θ z true z log θ z|x ,r p ( r|x ,θtrue ) ) Dirichlet distribution is exactly θ* , which is the optimal solution found by EM algorithm. The goal of the optimization in Equation (14) is to find an example such that by adding the example 3.2 Estimate the Expected Loss to the pool of rated examples, the updated model θ| x ,r is adjusted toward the true model θtrue . Similar to many other active learning work (Tong Unfortunately, the true model is unknown. One and Koller, 2000; Seung et al, 1992; Kai et al, way to approximate the optimization goal in 2003), we use the entropy of the model function as Equation (14) is to replace the θtrue with the the loss function. Many previous studies of active expectation over the posterior distribution learning try to find the example that directly p (θ | D( y ')) . Therefore, Equation (14) will be minimizes the entropy of the model parameters transformed into the following optimization with currently estimated model. In the case of problem: aspect model, it can be formulated as the following (15 optimization problem: x* = arg min − x∈X ∑θ z ' z log θ z|x ,r p ( r| x ,θ ') p ( θ '|D ( y )) ) (13 x = arg min − * x∈X ∑θ z z| x , r log θ z|x ,r p ( r| x ,θ ) ) Now, the key issue is how to compute the integration efficiently. In effect, in our case the where θ denotes the currently estimated integration can be computed analytically as: parameters and θ z|x ,r denotes the optimal parameter that are estimated using one additional rated example, i.e., example x is rated as r. The ∑θz ' z log θ z| x ,r p ( r | x ,θ ' ) p (θ ' | D ( y )) basic idea of Equation (13) is that, we would like to find the example x such that the expected = ∑ r ∫ p(θ ' | D( y )) p(r | x, θ ' )∑ j θ 'j log θ j|r , x dθ ' entropy of distribution θ is minimized. There are two problems with this simple strategy. The first = ∑ r ∑ i ∑ j p( r | x, i )θi'θ 'j log θ j|r , x (16 p (θ ' | D ( y )) problem comes from the fact that the expected ) entropy is computed only based on the currently ∑ a j log θ j|r , x ∑ i ai p(r | x, i ) estimated model. As indicated in Equation (13), = ∑ r j a* (a* + 1) the expectation is evaluated over the distribution p(r | x, θ) . As already discussed in Section 2, this + ∑r ∑ j a j log θ j|r , x p(r | x, j ) simple strategy could be misleading particularly a* (a* + 1) when the currently estimated model is far from the true model. The second problem with the strategy where the function Ψ is the digamma function. stated in Equation (13) comes from the The details of the derivation can be found in the characteristics of collaborative filtering. Recall appendix. With the analytic solution for the that the parameter θ z stand for the likelihood for integration in (16), we can simply go through each the active user to be in the user class z, what the item in the set and find the one that minimizes the optimization in Equation (13) tries to achieve is to loss function specified in Equation (14). find the most efficient way to reduce the uncertainty in the type of user class for the active Finally, the computation involves the estimation of user. However, according to previous studies in the updated model θ|r , x . The standard way to collaborative filtering (Si and Jin, 2003), most obtain the updated model is simply to rerun the users should be of multiple types, instead of single full EM algorithm with one more additional rated types. Therefore, we would expect to the example (i.e., example x is rated as category r). distribution θ be of multiple modes. Apparently, This computation can be extremely expensive this is inconsistent with the optimization goal in since we will have an updated model for every Equation (13). Based on the above analysis, the item and every possible rating category. A more 5
  6. 6. efficient way to compute the updated model is to Table 1: Characteristics of MovieRating and EachMovie. avoid the directly optimization of Equation (6). MovieRating EachMovie Instead, we use the following approximation, Number of Users 500 2000 Number of Items 1000 1682 θ* = arg max p (r | x, θ) θ∈Θ ∏ p (r | x ', y '; θ) Avg. # of rated Items/User 87.7 129.6 x '∈X ( y ') Number of Ratings 5 6 = arg max p (r | x, θ) p (θ | D( y ')) θ∈Θ (17) ≈ arg max p (r | x, θ) Dirichlet ({α z }) θ∈Θ utilizing the model distribution and see which one is more effective for collaborative filtering. The details ≈ arg max ∑ p (r | x, z ')θ z ' ∏ α θ z z −1 θ∈Θ of these two active learning methods will be z' z discussed later. The above Equation (17) can be again optimized using EM algorithm with updating equation as: 4.1 Experiment Design p (ri | xi , z )θ z + α z − 1 Two datasets of movie ratings are used in our θz = (18 experiments, i.e., ‘MovieRating’1 and ‘EachMovie’2. For α + ∑ ( p(ri | xi , z ')θ z ' − 1) * ) z' ‘EachMovie’, we extracted a subset of 2,000 users with more than 40 ratings. The details of these two datasets are The advantage of the updating equation in (18) listed in Table 1. For MovieRating, we took the first 200 versus the more general EM updating equation in (12) is that Equation (18) only relies on the rating users for training and all the others for testing. For r for item x while Equation (12) has to go through EachMovie, the first 400 users were used for training. For all the items that have been rated by the active each test user, we randomly selected three items with their user. Therefore, the computation cost of (18) is ratings as the starting seed for the active learning significantly less than that of Equation (12). algorithm to build the initial model. Furthermore, twenty rated items were reserved for each test user for evaluating In summary, in this section, we propose a the performance. Finally the active learning algorithm Bayesian approach for active learning of will have chance to ask for ratings of five different items collaborative filtering, which uses the model and the performance of the active learning algorithms is posterior distribution to compute the estimation of evaluated for each feedback. For simplicity, in this paper, loss function. To simplify the computation, we we assume that the active user will always be able to rate approximate the mode distribution with a Dirichlet the items presented by the active learning. Of course, as distribution. As a result, the expected loss can be pointed out in paper (Kai et al, 2003), this is not a correct calculated analytically. Furthermore, we also assumption because there are some items that the active propose an efficient way to compute the updated user might never see before and therefore it is impossible model by approximating the prior with a Dirichlet for him to rate those items. However, in this paper, we distribution. For easy reference, we call this will mainly focus on the behavior of active learning method ‘Bayesian method’. algorithms for collaborative filtering and leave this issue for future work. The aspect model is used for collaborative filtering as 4. Experiments already described in the Section 2. The number of user In this section, we will present experiment results in classes used in the experiment is 5 for MovieRating order to address the following two questions: dataset and 10 for EachMovie datasets. 1) Whether the proposed active learning algorithm is The evaluation metric used in our experiments was the effective for collaborative filtering? In the mean absolute error (MAE), which is the average absolute experiment, we will compare the proposed algorithm deviation of the predicted ratings to the actual ratings on to the method of randomly acquiring examples from items the test user has actually voted. the active user. 1 ^ MAE = ∑ | r(l ) − R y( l ) ( x(l ) ) | (19) 2) How important is the full Bayesian treatment? In this LTest l paper, we emphasize the importance of taking into where LTest is the number of the test ratings. account the model uncertainty using the posterior distribution. To illustrate this point, in this experiment, we will compare the proposed algorithm 1 to two commonly used active learning methods that are only based on the estimated model without 2 6
  7. 7. In this experiment, we consider three different baseline given by the active user is only three, which is relatively models for active learning of collaborative filtering: small given the number of parameters to determine is 10. Therefore, using the estimated model to compute the 1) Random Selection: This method simply randomly expected loss will be inaccurate for finding the most selects one item out of the pool of items for user’s informative item to rate. Furthermore, comparing the feedback. We refer this simple method as ‘random results for the ‘model entropy method’ to the results for method’. the ‘prediction entropy method’, we see that the model 2) Model Entropy based Sample Selection. This is the entropy method usually performs even worse. This can be method that has been described at the beginning of explained by the fact that the by minimize thing entropy Section 3.2. The idea is to find the item that is able to of the distribution of the user classes for the active user, reduce the entropy of the distribution of the active user the model entropy method tries to narrow down a single being in different user classes (in Equation (13)). As user type for the active user. However, most users are of aforementioned, the problems with this simple selection multiple types. Therefore, it is inappropriate to apply the strategy are in two aspects: computing the expected loss only based on the currently estimated model and being MovieRating conflict with the intuition that each user can be of multiple user classes. This simple method shares many 0.91 similarities as the proposed method except without random considering the model distribution. By comparing it to the Bayesian proposed active learning method for collaborative 0.88 model filtering, we will be able to see if the idea of using model prediction distribution for computing expected loss is worthwhile for collaborative filtering. For easy reference, we will call MAE 0.85 this method ‘model entropy method’. 3) Prediction Entropy based Sample Selection. Unlike the 0.82 previous method, which concerns with the uncertainty in the distribution of user classes, this method is focused on the uncertainty in the prediction of ratings for items. It 0.79 selects the item that is able to reduce the entropy of the 1 2 3 4 5 6 prediction most effectively. Formally, the selection num ber of feedback criterion can be formulated as the following optimization problem: Figure 1: MAE results for four different active learning algorithms for collaborative filtering, namely the random x = arg min − * x∈X ∑ p(r | x ', θ x' | x ,r ) log p ( r | x ', θ| x ,r ) (20) method, the Bayesian method, the model entropy method, p ( r | x ,θ ) and the prediction entropy method. The database is where notation θ|r , x stands for the updated model using ‘MovieRating’ and the number of training user is 200. The the extra rated example (r,x,y). Similar to the previous smaller MAE the better performance method, this approach uses the estimated model for computing the expected loss. However, unlike the model entropy method to the active learning for previous approach which tries to make the distribution of collaborative filtering. On the other hand, the prediction user types for the active user skew, this approach targets entropy method doesn’t have this defect because it on the prediction distribution. Therefore, this method is focuses on minimizing the uncertainty in the prediction not against the intuition that each user is of multiple instead of the uncertainty in the distribution of user types. We will refer this method as ‘prediction entropy classes. method’. The second observation from Figure 1 and 2 is that the proposed active learning method performs better than any 4.2 Results and Discussion of the three based line models for both ‘MovieRating’ and The results for the proposed active learning algorithm ‘EachMovie’. The most important difference between the together with the three baseline models for database proposed method and the other methods for active ‘MovieRating’ and ‘EachMovie’ are presented in Figure 1 learning is that the proposed method takes into account and 2, respectively. the model distribution, which makes it robust when even there is only three rated items given by the active user. The first thing noticed from this two diagrams is that, both the ‘model entropy method’ and the ‘prediction entropy method’ performs consistently worse than the 5. Conclusion simply ‘random method’. This phenomenon can be explained by the fact that the initial number of rated items 7
  8. 8. Freund, Y. Seung, H. S., Shamir, E. and Tishby. N. Each Movie (1997) Selective sampling using the query by committee algorithm. Machine Learning 1.18 Hofmann, T., & Puzicha, J. (1999). Latent class models 1.15 for collaborative filtering. In Proceedings of International Joint Conference on Artificial 1.12 Intelligence. Hofmann, T. (2003). Gaussian latent semantic models for 1.09 collaborative filtering. In Proceedings of the 26th 1.06 Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval. 1.03 Jin, R., Si, L., Zhai., C.X. and Callan J. (2003). Collaborative filtering with decoupled models for 1 preferences and ratings. In Proceedings of the 12th 1 2 3 4 5 6 International Conference on Information and num ber of feedbacks Knowledge Management. Figure 1: MAE results for four different active learning Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for algorithms for collaborative filtering, namely the random graphical models. In M. I. Jordan (Ed.), Learning in method, the Bayesian method, the model entropy method, graphical Models, Cambridge: MIT Press. and the prediction entropy method. The database is ‘MovieRating’ and the number of training user is 200. The Roy, N. and McCallum A. (2001). Toward Optimal smaller MAE the better performance Active Learning through Sampling Estimation of Error Reduction. In Proceedings of the Eighteenth In this paper, we proposed a full Bayesian treatment of International Conference on Machine Learning. active learning for collaborative filtering. Different from previous studies of active learning for collaborative Pennock, D. M., Horvitz, E., Lawrence, S., and Giles, C. filtering, this method takes into account the model L. (2000). Collaborative filtering by personality distribution when computing the expected loss. In order to diagosis: a hybrid memory- and model-based approach. alleviate the computation complexity, we approximate the In Proceedings of the Sixteenth Conference on model posterior with a simple Dirichlet distribution. As a Uncertainty in Artificial Intelligence. result, the estimated loss can be computed analytically. Seung, H. S., Opper, M. and Sompolinsky, H. (1992). Furthermore, we propose using the approximated prior Query by committee. In Computational Learing Theory, distribution to simplify the computation of updated pages 287—294. model, which is required for every item and every possible rating category. As the current Bayesian version Si, L. and Jin, R. (2003). Flexible mixture model for of active learning for collaborative filtering only focus on collaborative filtering. In Proceedings of the Twentieth the model quality, we will extend this work to the International Conference on Machine Learning. prediction error, which usually is more informative Tong, S. and Koller, D. (2000). Active Learning for according to the previous studies of active learning. Parameter Estimation in Bayesian Networks. In Advances in Neural Information Processing Systems. Yu, K., Schwaighofer, A., Tresp, V., Ma, W.-Y. and References Zhang, H. J. (2003). Collaborative ensemble learning: Breese, J. S., Heckerman, D., Kadie C., (1998). Empirical combining collaborative and content-based Information Analysis of Predictive Algorthms for Collaborative filtering via hierarchical Bayes. In Proceedings of Filtering. In the Proceeding of the Fourteenth Nineteenth International Conference on Uncertainty in Conference on Uncertainty in Artificial Intelligence. Artificial Intelligence. Boutilier, C., Zemel, R. S., and Marlin, B. (2003). Active Campbell, C., Cistianini, N., and Smola, A. (2000). Query collaborative filtering. In the Proceedings of Nineteenth Learning with Large Margin Classifiers. In Proceedings Conference on Uncertainty in Artificial Intelligence. of 17th International Conference on Machine Learning. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Abe, N. and H. Mamitsuka (1998). Query learning Maximum likelihood from incomplete data via the EM strategies using boosting and bagging. In Proceedings algorithm. Journal of the Royal Statistical Society, B39: of the Fifteenth International Conference on Machine 1-38. Learning. 8
  9. 9. 9