• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Transfer learningforclp

  • 1,156 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,156
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cao et al. ICML 2010
    Presented by Danushka Bollegala.
    Transfer Learning for Collective Link Prediction in Multiple Heterogenous Domains
  • 2. Link Prediction
    Predict links (relations) between entities
    Recommend items for users (MovieLens, Amazon)
    Recommend users for users (social recommendation)
    Similarity search (suggest similar web pages)
    Query suggestion (suggest related queries by other users)
    Collective Link Prediction (CLP)
    Perform multiple prediction tasks for the same set of users simultaneously
    Predict/recommend multiple item types (books and movies)
    Pros
    Prediction tasks might not be independent, one can benefit from another (books vs. movies vs. food)
    Less affected by data sparseness (cold start problem)
  • 3. Link prediction = matrix factorization
    Probabilistic Principal Component Analysis (PPCA) (Bishop & Tipping, 1999)
    PRML Chapter 12.
    Probabilistic non-linear
    matrix factorization
    Lawrence &
    Utrasun, ICML 2009
    Task similarity
    Matrix, T
    Gaussian
    Process for Regression (GPR)
    (PRML Sec. 6.4)
    Transfer Learning+
    Collective Link Prediction
    (this paper)
  • 4. Link Modeling via NMF
    Link matrix X (xi,j is the rating given by user I to item j)
    Xi,j is modeled by f(ui, vj, ε)
    f: link function
    ui: latent representation of a user i
    vj: latent representation of an item j
    ε: noise term
    Generalized matrix approximation
    Assumption: E is Gaussian noise N(0, σ2I)
    Use Y = f-1(X)
    Then, Y follows a multivariate Gaussian distribution.
  • 5. Gaussian Process Regression
    Revision (PRML Section 6.4)
  • 6. Functions as Vectors
    We can view a function as an infinite dimensional vector
    f(x): (f(x1), f(x2),...)T
    Each point in the domain is mapped by f to a dimension in the vector
    In machine learning we must find functions (e.g. linear predictors) that map input values to their corresponding output values
    We must also avoid over-fitting
    This can be visualized as sampling from a distribution over functions with certain properties
    Preference bias (cf. restriction bias)
  • 7. Gaussian Process (GP) (1/2)
    Linear regression model
    We get different output functions y for different weight vectors w.
    Let us impose a Gaussian prior over w
    Train dataset: {(x1,y1),...,(xN,yN)}
    Targets: y=(y1,...,yN)T
    Design matrix
  • 8. Gaussian Process (2/2)
    When we impose a Gaussian prior over the weight vector, then the target y is also Gaussian.
    K: Kernel matrix (Gram matrix)
    k: kernel function
  • 9. Gaussian Process: Definition
    Gaussian process is defined as a probability distribution over functions y(x) such that the set of values y(x) evaluated at an arbitrary set of points x1,...,xN jointly have a Gaussian distribution.
    p(x1,...,xN) is Gaussian.
    Often the mean is set to zero
    Non-informative prior
    Then the kernel function fully defines the GP.
    Gaussian kernel:
    Exponential Kernel:
  • 10. Gaussian Process Regression (GPR)
    Predict outputs with noise
    x
    y
    t
    e
  • 11. Probabilistic Matrix Factorization
    PMF can be seen as a Gaussian Process with latent variables (GP-LVM) [Lawrence & Utrasun ICML 2009]
    Generalized matrix approximation model
    Y=f-1(X) follows a multivariate Gaussian distribution
    A Gaussian prior is set on U
    Probabilistic PCA model by
    Tipping & Bishop (1999)
    Non-linear version
    Mapping
    back to X
  • 12. Ratings are not Gaussian!
  • 13. Collective Link Prediction
    GP model for each task
    A single model for all tasks
  • 14. Tensor Product
    Known as Kronecker product for two matrices (e.g., numpy,kron(a,b))
  • 15. Generalized Link Functions
    Each task might have a different rating distribution.
    c, α, b are parameters that must be estimated from the data.
    We can relax the constraint α > 0 if we have no prior knowledge regarding the negativity of the skewness of the rating distribution.
  • 16. Predictive distribution
    Similar to GPR prediction
    Predicting y= g(x)
    Predicting x
  • 17. Parameter Estimation
    Compute the likelihood of the dataset
    Use Stochastic Gradient Descent for optimization
    Non-convex optimization
    Sensitive to initial conditions
  • 18. Experiments
    Setting
    Use each dataset and predict multiple items
    Datasets
    MovieLens
    100000 ratings, 1-5 scale ratings, 943 users, 1682 movies, 5 popular genres
    Book-Crossing
    56148 ratings, 1-10 scale, 28503 users, 9909 books, 4 most general Amazon book categories
    Douban
    A social network-based recommendation serivce
    10000 users, 200000 items
    Movies, books, music
  • 19. Evaluation
    Evaluation measure
    Mean Absolute Error (MAE)
    Baselines
    I-GP: Independent Link Prediction using GP
    CMF: Collective matrix factorization
    non GP, classical NMF
    M-GP: Joint Link prediction using multi-relational GP
    Does not consider the similarity between tasks
    Proposed method = CLP-GP
  • 20. Results
    Note: (1) Smaller values are better
    (2) with(+)/without(-) link function.
  • 21. Total data sparseness
    Good
  • 22. Target task data sparseness
  • 23. Task similarity matrix (T)
    Romance and Drama are very similar
    Action and Comedy are very dissimilar
  • 24. My Comments
    Elegant model and well-written paper
    Few parameters (latent space dimension k) need to be specified
    All other parameters can be learnt
    Applicable to a wide range of tasks
    Cons:
    Computational complexity
    Predictions require kernel matrix inversion
    SGD updates might not converge
    The problem is non-convex...