Recommender Systems in the Linked Data era

1,377 views
1,202 views

Published on

The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.

Published in: Technology

Recommender Systems in the Linked Data era

  1. 1. Recommender Systems in the Linked Data era ROBERTO MIRIZZI, PHD roberto.mirizzi@gmail.com
  2. 2. Outline What is a Recommender System? ◦ A definition ◦ Types What is Linked Data? ◦ LOD ◦ DBpedia Some Recommender Systems (RS): ◦ A content-based RS (memory-based) ◦ A mobile content-based RS (memory-based) ◦ A content-based RS (model-based) ◦ A hybrid RS (model-based)
  3. 3. What is a Recommender System?
  4. 4. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 What is a Recommender System? Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user. [F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors. Recommender Systems Handbook. Springer, 2011.] Input Data: A set of users U = {u1, …, uM} A set of items I = {i1, …, iN} The preference matrix R = [ru,i] Problem Definition: Given user u and target item i Predict the preference ru,i ? ?
  5. 5. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Content-based (CB): recommendations are based on the assumption that if in the past a user liked a set of items with particular features, they will likely go for items having similar characteristics Recommender Systems: types animation fairytale ogre castle Collaborative-filtering (CF): recommendations are based on the assumption that users having similar history are more likely to have similar tastes/needs Hybrid: it’s not too hard to guess what they are 
  6. 6. What is Linked Data?
  7. 7. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 What is Linked Data? A collection of interrelated datasets on the Web Principles: 1. Use HTTP URIs to identify things 2. Leverage standards such as RDF and SPARQL to provide information about things 3. Link related things by relationships [http://linkeddata.org/]
  8. 8. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 What is Linked Data? A collection of interrelated datasets on the Web Principles: 1. Use HTTP URIs to identify things 2. Leverage standards such as RDF and SPARQL to provide information about things 3. Link related things by relationships [http://linkeddata.org/]
  9. 9. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 foaf:page DBpedia: a Nucleus for a Web of Open Data http://dbpedia.org DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. [Auer et al., DBpedia: A Nucleus for a Web of Open Data. ISWC+ASWC 2007] [Bizer et el., A crystallization point for the Web of Data. Journal Web Semantics, 2009]
  10. 10. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Querying DBpedia: SPARQL DBpedia exposes a SPARQL endpoint (http://dbpedia.org/sparql) to query the dataset. Results can be provided in several formats (e.g., JSON, XML, NTriples, etc.) SPARQL is an RDF query language. Its queries consist of triple patterns, conjunctions, disjunctions and optional patterns
  11. 11. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 A graph of knowledge Why don’t we use all this information to foster recommender systems? Ocean’s Eleven George Clooney Brad Pitt Ocean’s Twelve Steven Soderbergh Catherine Zeta- Jones 2000s crime films American criminal comedy films Crime films Crime
  12. 12. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 A graph of knowledge Ocean’s Eleven George Clooney Brad Pitt Ocean’s Twelve Steven Soderbergh Catherine Zeta- Jones 2000s crime films American criminal comedy films Crime films Crime Why don’t we use all this information to foster recommender systems? likes likes
  13. 13. A content-based RS (memory-based)
  14. 14. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 The good old Vector Space Model [http://en.wikipedia.org/wiki/File:Vector_space_model.jpg] The Vector Space Model is an algebraic model for representing both text documents and queries as vectors of index terms wt,d that are positive and non-binary. 1, 2, ,, ,..., T d d d N dv w w w    , ,t d t d tw tf idf  , , , t d t d k dk n tf n   , ,1 2 2 , ,1 1 ( , ) N i j i qj q i j N N j i j i qi i w wd d sim d q d q w w            ' ' logt D idf d D t d   
  15. 15. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Semantic Vector Space Model (i) Ocean’s Eleven George Clooney Steven Soderberg 2000s crime films Crime starring director subject/broader genre Ocean’s Twelve Brad Pitt Catherine Zeta-Jones Crime films American criminal… Ocean’s Eleven Ocean’s Twelve starring Each item is expressed as a tensor in a multi- dimensional space where each dimension corresponds to a specific property of the considered datasets (e.g., starring, subject/broader, director, genre, …)
  16. 16. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 STARRING George Clooney [gc] (38 movies) Catherine Z. Jones [czj] (22 movies) Brad Pitt [bp] (35 movies) Ocean’s Eleven [o11] (13 actors)    Ocean’s Twelve [o12] (15 actors)    STARRING George Clooney [gc] (38 movies) Catherine Z. Jones [czj] (22 movies) Brad Pitt [bp] (35 movies) Ocean’s Eleven [o11] (13 actors)    Ocean’s Twelve [o12] (15 actors)    Semantic Vector Space Model (ii) starring George Clooney [gc] Catherine Z. Jones [czj] Brad Pitt [bp] Ocean’s Eleven [o11] Ocean’s Twelve [o12] , ,x y x y xactor movie actor movie actorw tf idf  11,gc ow 12,gc ow 12,czj ow 11,bp ow 12,bp ow 11,czj ow We can now compute the scalar product between the two vectors to get their similarity…
  17. 17. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Semantic Vector Space Model (iii) 12 11 12 11 12 11 12 12 12 11 11 11 , , , , , , 12 11 2 2 2 2 2 2 , , , , , , ( , ) gc o gc o czj o czj o bp o bp o starring gc o czj o bp o gc o czj o bp o w w w w w w sim o o w w w w w w            …and then combine all the similarities for each property: 12 11 12 11 12 11 12 11( , ) () ) ( ,( , , )starring directostarring director subjecr subjecttsim o o sis m oim o si o oo mo         soon we will see how to compute the p coefficients
  18. 18. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Ready for our first Content-based RS  ( ) , 1 if likes , 1 otherwisej j j j jprofile u m r r u m r      ( ) ( , ) ( , ) ( ) j p p j i p j m profile u i sim m m r P r u m profile u       Given a user profile, defined as: We predict the rating using a Nearest Neighbor Classifier (Memory-based) where the similarity measure is a linear combination of local similarities:  ( ) ,j j jprofile u m r r    or as: [Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, Markus Zanker. Linked Open Data to support Content-based Recommender Systems. 8th International Conference on Semantic Systems (I-SEMANTICS 2012) – best paper]
  19. 19. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 How do we compute the p coefficients? We need to identify the best possible values for the coefficient p, that is the weights associated with each property. There are plenty of choices to do that. Depending on the nature of the user ratings (Likert or binary), we can consider the rating prediction as a regression problem (linear regression) or as a classification problem (logistic regression), and minimize a loss function J(). In the former case we can minimize the least squares loss function, and in the latter case we can minimize the cross-entropy loss function. In both cases we can use gradient descent:  p p p J         Another possible approach is to use a genetic algorithm, to minimize a not smooth loss function, such as the number of misclassification errors.
  20. 20. A mobile content-based RS (memory-based)
  21. 21. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Let’s go Mobile (e.g., recommend movies in theaters) [Vito Claudio Ostuni, Giosia Gentile, Tommaso Di Noia, Roberto Mirizzi, Davide Romito, Eugenio Di Sciascio. Mobile Movie Recommendations with Linked Data. Human-Computer Interaction & Knowledge Discovery @ CD-ARES’13 (HCI-KDD 2013)]  ( , ) , 1 if likes with companion , 1 otherwisej j j j jprofile u cmp m r r u m cmp r      This time the user profile is context-dependent and is defined as: ( , , ) ( , , ) ( )i prefFilter preFilter i postFilter postFilterr u m cmp r u m cmp r u     h (hierarchy): 1 if the theater is in the same city, 0 otherwise c (cluster): 1 if the theater is a multiplex, 0 otherwise cl (co-location): 1 if the theater is close to other POIs, 0 otherwise ar (association-rule): 1 if the ticket price is known, 0 otherwise ap (anchor-point proximity): 1 if the theater is close to the user home or office, 0 otherwise ( ) 5 postFilter h c cl ar ap r u      ( , ) ( , ) ( , , ) ( , ) j j j i m profile u cmp preFilter i r sim m m r u m cmp profile u cmp     And the prediction is made by two parts, contextual pre-filtering and contextual post-filtering:
  22. 22. A content-based RS (model-based)
  23. 23. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Time for a Model-based CB-RS George Clooney [gc] Catherine Z. Jones [czj] Brad Pitt [bp] starring Ocean’s Eleven [o11] Ocean’s Twelve [o12] Steven Soderbergh [ss] director 2000s crime films [2cf] Crime films [cf] American criminal comedy [acc] subject 11,gc ow 12,gc ow 12,czj ow 11,bp ow 12,bp ow 11,czj ow 112 ,cf ow 122 ,cf ow 12,cf ow 11,acc ow 12,acc ow 11,cf ow11,ss ow 12,ss ow This time each item is represented by a feature vector, where each feature corresponds to a property value.  ( ) , 1 if likes , 1 otherwisej j j j jprofile u m r r u m r     The user profile is defined as: [Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito. Exploiting the Web of Data in Model-based Recommender Systems. 6th ACM Conference on Recommender Systems (RecSys 2012)]
  24. 24. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Training the system with an SVM classifier [https://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png] Support Vector Machine (SVM) is known to work well for text classification. Our problem of learning the user profile has a lot of commonalities with it, such as the sparse nature of the feature vector and the high dimensionality of the input space. Main advantages: 1. Feature selection is often not needed (SVM robust to over-fitting and scales up pretty well) 2. No need to tune parameters like before We then fit a logistic model to SVM output to obtain a ranked list of items.
  25. 25. A hybrid RS (model-based)
  26. 26. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Let’s continue with a Hybrid RS [Vito Claudio Ostuni, Tommaso Di Noia, Eugenio Di Sciascio, Roberto Mirizzi. Top-N Recommendations from Implicit Feedback leveraging Linked Open Data. 7th ACM Conference on Recommender Systems (RecSys 2013)] We want to recommend items i to user u, exploiting both the LOD knowledge base and other users’ interactions. The ultimate goal of this recommendation system is to rank in the top-N positions items to be likely relevant for the user, in presence of implicit feedback. Given the nature of the problem, the user profile is defined as:  ( ) is relevant forprofile u i i u
  27. 27. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Path-based features 1 # ( ) ( ) # ( ) ui ui D ui d path j x j path d    We define as the feature vector encoding all the interactions between user u and item i. Each component of this vector represents the relevance score between u and i with respect to a particular feature, and is defined as: D uix  The paths can be content-based, collaborative or hybrid.
  28. 28. Recommender Systems in the Linked Data Era – HP Labs, Palo Alto, CA 7/12/2013 Learning the ranking function In order to predict the ranking and form the top-N recommendation lists we deal with the learning to rank problem by adopting a point-wise approach. In particular we use a combination of Random Forests and Gradient Boosted Regression Trees (GBRT).
  29. 29. Thank you!

×