Transfer learningforclp

Cao et al. ICML 2010
Presented by Danushka Bollegala.

 Predict links (relations) between entities
 Recommend items for users (MovieLens, Amazon)
 Recommend users for users (social recommendation)
 Similarity search (suggest similar web pages)
 Query suggestion (suggest related queries by other users)
 Collective Link Prediction (CLP)
 Perform multiple prediction tasks for the same set of users
simultaneously
▪ Predict/recommend multiple item types (books and movies)
 Pros
 Prediction tasks might not be independent, one can
benefit from another (books vs. movies vs. food)
 Less affected by data sparseness (cold start problem)

Transfer Learning+
Collective Link Prediction
(this paper)
Gaussian
Process for Regression
(GPR)
(PRML Sec. 6.4)
Link prediction = matrix factorization
Probabilistic Principal
Component Analysis (PPCA)
(Bishop &Tipping, 1999)
PRML Chapter 12.
Probabilistic non-linear
matrix factorization
Lawrence &
Utrasun,
ICML 2009
Task similarity
Matrix,T

 Link matrix X (xi,j is the rating given by user I to item j)
 Xi,j is modeled by f(ui, vj, ε)
 f: link function
 ui: latent representation of a user i
 vj: latent representation of an item j
 ε: noise term
 Generalized matrix approximation
 Assumption: E is Gaussian noise N(0, σ2I)
 Use Y = f-1(X)
 Then, Y follows a multivariate Gaussian distribution.

 We can view a function as an infinite dimensional
vector
 f(x): (f(x1), f(x2),...)T
 Each point in the domain is mapped by f to a dimension in
the vector
 In machine learning we must find functions (e.g. linear
predictors) that map input values to their
corresponding output values
 We must also avoid over-fitting
 This can be visualized as sampling from a distribution
over functions with certain properties
 Preference bias (cf. restriction bias)

 Linear regression model
 We get different output functions y for
different weight vectors w.
 Let us impose a Gaussian prior over w
 Train dataset: {(x1,y1),...,(xN,yN)}
 Targets: y=(y1,...,yN)T
 Design matrix

 When we impose a Gaussian prior over the
weight vector, then the target y is also
Gaussian.
 K: Kernel matrix (Gram matrix)
 k: kernel function

 Gaussian process is defined as a probability
distribution over functions y(x) such that the set
of values y(x) evaluated at an arbitrary set of
points x1,...,xN jointly have a Gaussian
distribution.
 p(x1,...,xN) is Gaussian.
 Often the mean is set to zero
 Non-informative prior
 Then the kernel function fully defines the GP.
 Gaussian kernel:
 Exponential Kernel:

 Predict outputs with noise
x y
e
t

 PMF can be seen as a Gaussian Process with latent variables
(GP-LVM) [Lawrence & Utrasun ICML 2009]
Generalized matrix approximation model
Y=f-1(X) follows a multivariate Gaussian distribution
A Gaussian prior is set on U
Probabilistic PCA model by
Tipping & Bishop (1999)
Non-linear version
Mapping
back to X

 GP model for each task
 A single model for all tasks

 Known as Kronecker product for two
matrices (e.g., numpy,kron(a,b))

 Each task might have a different rating
distribution.
 c, α, b are parameters that must be estimated
from the data.
 We can relax the constraint α > 0 if we have
no prior knowledge regarding the negativity
of the skewness of the rating distribution.

 Similar to GPR prediction
 Predicting y= g(x)
 Predicting x

 Compute the likelihood of the dataset
 Use Stochastic Gradient Descent for
optimization
 Non-convex optimization
 Sensitive to initial conditions

 Setting
 Use each dataset and predict multiple items
 Datasets
 MovieLens
▪ 100000 ratings, 1-5 scale ratings, 943 users, 1682 movies, 5
popular genres
 Book-Crossing
▪ 56148 ratings, 1-10 scale, 28503 users, 9909 books, 4 most
general Amazon book categories
 Douban
▪ A social network-based recommendation serivce
▪ 10000 users, 200000 items
▪ Movies, books, music

 Evaluation measure
 Mean Absolute Error (MAE)
 Baselines
 I-GP: Independent Link Prediction using GP
 CMF: Collective matrix factorization
▪ non GP, classical NMF
 M-GP: Joint Link prediction using multi-relational GP
▪ Does not consider the similarity between tasks
 Proposed method = CLP-GP

Note: (1) Smaller values are better
(2) with(+)/without(-) link function.

 Romance and Drama are very similar
 Action and Comedy are very dissimilar

 Elegant model and well-written paper
 Few parameters (latent space dimension k)
need to be specified
 All other parameters can be learnt
 Applicable to a wide range of tasks
 Cons:
 Computational complexity
▪ Predictions require kernel matrix inversion
▪ SGD updates might not converge
▪ The problem is non-convex...

Transfer learningforclp

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transfer learningforclp

Similar to Transfer learningforclp (20)

Transfer learningforclp