Personalization Tutorial at ACM Compute 2008
Upcoming SlideShare
Loading in...5
×
 

Personalization Tutorial at ACM Compute 2008

on

  • 17,483 views

Tutorial on Personalization techniques. Covers user profile creation, document modeling techniques (including LSI an PLSI) and the use of semantics in personalization

Tutorial on Personalization techniques. Covers user profile creation, document modeling techniques (including LSI an PLSI) and the use of semantics in personalization

Statistics

Views

Total Views
17,483
Views on SlideShare
17,363
Embed Views
120

Actions

Likes
30
Downloads
348
Comments
3

11 Embeds 120

http://www.scienceforseo.com 63
http://acmbangalore.org 27
http://www.slideshare.net 11
http://www.linkedin.com 7
http://jisi.dreamblog.jp 3
http://www.acmbangalore.org 3
http://s3.amazonaws.com 2
http://www.lmodules.com 1
http://health.medicbd.com 1
http://www.health.medicbd.com 1
http://feeds2.feedburner.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Personalization Tutorial at ACM Compute 2008 Personalization Tutorial at ACM Compute 2008 Presentation Transcript

  • Personalization: Techniques and applications Krishnan Ramanathan, Geetha Manjunath, Somnath Banerjee HP Labs, Bangalore © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Topics • Overview of Personalization • User Profile creation • Personalizing Search • Document modeling • Recommender system • Semantics in Personalization 2 22 January 2008
  • Overview of Personalization Krishnan Ramanathan © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Why Personalization ? • Scale of the web is limiting its utility • There is too much information • Consumer has to do all the work to use the web • Search engines and portals provide the same results for different personalities, intentions and contexts • Personalization can be the solution • Customize the web for individuals by • Filtering out irrelevant information • Identifying relevant information 4 22 January 2008
  • Some quotes from NY bits •I am married with a house. Why do I see so many ads for online dating sites and cheap mortgages? Should I be happy that I see those ads? It means Internet advertisers still have no idea who I am. 5 22 January 2008
  • Personalization • Goal – Provide users what they need without requiring them to ask for it explicitly • Steps • Generate useful, actionable knowledge about users • Use this knowledge for personalizing an application • User centric data model – Data must be attributable to specific user • Two kinds • Business Centric : Amazon, Ebay • Consumer Centric • Personalization requires User Profiling 6 22 January 2008
  • Applications of Personalization • Interface Personalization • E.g. Go directly to the web page of interest instead of site home page • Content personalization • Filtering (News, blog articles, videos etc) • Ratings based recommendations • Amazon, Stumbleupon • Search • Text, images, stories, research papers • Ads • Service Personalization 7 22 January 2008
  • Why is personalization hard ? • Server side personalization – Sites do not see all data • E.g. A user might visit Expedia and Orbitz, Expedia doesn’t know what the user did on Orbitz • Difficult to get user context • User needs to agree to cookies or login • Site profiles are not portable • Some standards are emerging (Attention profile markup language) • Privacy 8 22 January 2008
  • Personalization example 1 (Routing queries) Google alerts Google news page routing queries 9 22 January 2008
  • Personalization example 2 - Amazon 10 22 January 2008
  • Personalization Example 3 – Google news 11 22 January 2008
  • Personalization Example 4 – Yahoo MyWeb 12 22 January 2008
  • The future … 13 22 January 2008
  • User Profile Creation Krishnan Ramanathan © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Outline • User Profile creation • Profile Privacy • Evaluating and managing user profiles • Personalizing search 15 22 January 2008
  • User profile information • Two kinds of information • Factual (Explicit) • Behavioral (Implicit) • Factual – Geographic, Demographic, Psychographic information • Eg. Age is 25 years, searched for Lexus, lives in Bangalore • Behavioral – Describes behavioral activities (visits finance sites, buys gadgets) 16 22 January 2008
  • Client side versus Server side profiles Server side Client side Have queries, clickstreams from No access to clickstreams of multiple users multiple users Don’t see all the user data See all user data No way for users to aggregate Possible for user to aggregate and reuse the profiles different and reuse their attentional websites (Google, Yahoo, ..) build information using their data Strong privacy model Privacy is a big problem Can access the full compute Server cycles have to be shared, power at the client however some computations can be done once and reused 17 22 January 2008
  • Desired profile characteristics • Represent multiple interests • Adapt to changing user interests • Incorporate contextual information 18 22 January 2008
  • Using User profiles to personalize services Search query, Content news, video, … Explicit and User Implicit info Profile Profile Data Profile to Content Collection Constructor Matching User Personalized services Diagram adapted from Gauch et.al, Chapter 2, The Adaptive Web, Springer LNCS 4321 19 22 January 2008
  • User Profiling approaches • Broadly two approaches • IR approach • User interests derived from text (documents/search queries) • Machine learning approach • Model user based on positive and negative examples of his interests • Problems • Getting labeled samples • High dimensional feature space 20 22 January 2008
  • Profile building Steps • Authenticate the user • Select information to build profile from and archive the information if necessary (eg. Web pages might get flushed from IE cache) • Build/Refresh/Expand/Prune the profile • Use it in an application • Evaluate the profile 21 22 January 2008
  • Authenticating the user • Users need to be authenticated in order to attribute data to a particular user for profile creation • Identifying single user • Login • Cookies • IP address (when it is static) • Identifying different users on same machine • Login • Biometrics 22 22 January 2008
  • Explicit user information collection • Ask the user for • static information • Name, age, residence location, hobbies, interests etc • Google personalization – found explicit information to be noisy • People specified literature as one of their interests but did not make a single related search • Matchmine – presents examples (movies, TV shows, music, blog topics) and asks the users to explicity rate them • Ratings • Netflix, Stumbleupon (thumbs up/down) • In general, people do not like to give explicit information frequently • Recent research (Jian Hu WWW 2007) showed good results for gender and age prediction based on users browsing behavior 23 22 January 2008
  • Explicit information collection: Matchmine interface 24 22 January 2008
  • Implicit user information collection • Data sources • Web pages, documents, search queries, location • Information from applications (Media players, Games) • Data collection techniques • Desktop based • Browser cache • Proxy servers • Browser plugins • Server side • Web logs • Search logs 25 22 January 2008
  • How much implicit info to use ? • Teevan (SIGIR 2005) constructed two profiles • One with only search queries • Other using all information on desktop • Findings • More richer information => better profile • All docs better than only recent docs better than only web pages better than only search queries better than no personalization • Drawback with implicit info – cannot collect info about user dislikes 26 22 January 2008
  • Stereotypes • Generalizations from communities of users • Characteristics of group of users • Stereotypes alleviate the bootstrap problem • Construction of stereotypes • Manual – e.g. Bangalore user will be interested in IT • Automatic method • Clustering – Similar profiles are clustered and common characteristics extracted 27 22 January 2008
  • How Acxiom delivers personalized ads (source - WSJ) • Acxiom has accumulated a database of about 133 million households and divided it into 70 demographic and lifestyle clusters based on information available from public sources. • A person gives one of Acxiom’s Web partners his address by buying something, filling out a survey or completing a contest form on one of the sites. • Acxiom checks the address against its database and places a “cookie,” or small piece of tracking software, embedded with a code for that person’s demographic and behavioral cluster on his computer hard drive. • When the person visits an Acxiom partner site in the future, Acxiom can use that code to determine which ads to show • Through another cookie, Acxiom tracks what consumers do on partner Web sites 28 22 January 2008
  • Profile representation • Bag of words (BOW) • Use words in user documents to represent user interests • Issues • Words appear independent of page content (“Home”, “page”) • Polysemy (word has multiple meanings e.g. bank) • Synonymy (multiple words have same meanings e.g. joy, happiness) • Large profile sizes • Concepts (e.g. DMOZ) • Use existing ontology maintained for free • Issues • Too large (about 6 lakh DMOZ nodes), ontology has to be drastically pruned for use • Need to build classifiers for each DMOZ node 29 22 January 2008
  • Word based term vector profiles • Profile represented as sets of words tf*idf weighted • Could use one long profile vector or different vectors for different topics (sports, health, finance) • Documents converted to same representation, matched with keyword vectors using cosine similarity • Should take structure of the document into account (ignore html tags, email header vs body) 30 22 January 2008
  • Word based hierarchical profiles Support of User Profile:10 Interest Research:5 Sports:3.5 Sex:1.5 IR:3 DB:2 Soccer:2 Others:1.5 Search:2 ... Support decreases from high to low level, and from left to right We are thankful to Yabo Arber-Xu from Simon Fraser University for kindly allowing us to use slides numbered 31,37,38,39 from his WWW 07 presentation. 31 22 January 2008
  • Building word based hierarchical profiles • Builda (word, document) map for each word occurring in the corpus • Order words by amount of support • Support of a word = number of documents in which word appears • For each word • Decide whether to merge with another word (using some measure of similarity) • Decide whether to make one word the child of other 32 22 January 2008
  • 33 22 January 2008
  • Term similarity and Parent-child terms • Words that cover the same document sets are similar • Jacquard measure Sim( w1, w2) =| D( w1 ) I D( w2 ) | / | D ( w1 ) U D ( w2 ) | • Parent child terms • A specific term is a child of a more general term if it frequently occurs with a general term (but the reverse is not true) • Word w2 is taken as child of term w1 if P(w1|w2) > some_threshold • e.g. Terms “Soccer” and “Badminton” might co-occur with the term “Sport” but not the other way around 34 22 January 2008
  • Personalization and Privacy • Studies have shown that • People are comfortable sharing preferences (favourite TV show, snack etc.), demographic and lifestyle information • People not comfortable sharing financial and purchase related information • Facebook fiasco because of reporting “Your friends bought …” • Financial rewards (even small amounts) encourage disclosure • People parted with valuable information for Singapore $15 35 22 January 2008
  • Privacy related attitudes (Teltzrow/Kobsa 2003) 36 22 January 2008
  • What and How much to Reveal? - 1 More User Profile:10 Sensitive More Research:5 Sports:3.5 Sex:1.5 specific IR:3 DB:2 Soccer:2 Others:1.5 Search:2 ... Manual Option – Absolute privacy guarantee, but requires a lot of user intervention 37 22 January 2008
  • What and How much to Reveal? - 2 User Profile U à indicator of a user’s possible interests Term t à indicator of a possible interest, P(t)=Sup(t)/|D| The amount of information for an interest t I(t) = log(1/P(t))= log(|D|/ Sup(t)). àindication of the specificity and sensitivity of an interest H(U) – the amount of information carried by U H(U)=∑tP(t)×I(t) Two Privacy Parameters: MinDetail - Protect t with P(t)<MinDetail ExpRatio – H(U[exp] )/H(U) The more detail we expose, the higher expRatio. 38 22 January 2008
  • What and How much to Reveal? - 3 User Profile:10 minDetail=0.5 expRatio=44% minDetail=0.3 Research:5 Sports:3.5 Sex:1.5 expRatio=69% IR:3 DB:2 Soccer:2 Others:1.5 Search:2 ... The mindetail and expRation parameters allow a balance between privacy and personalization. 39 22 January 2008
  • Profile portability • Move the profile to a central server • Claria PersonalWeb, Google-Yahoo-Microsoft • Provision to delete search queries, visited pages • No control over which part of the profile can be used • Have a client side component that reconstructs the profile on the client using server side info (Matchmine) • Attention Profile markup language • Allows explicit and implicit information to be stored (as XML) and provided to web services 40 22 January 2008
  • Attention Profile Markup (http://www.apml.org) 41 22 January 2008
  • Application-independent evaluation of the profile • Stability • Number of profile elements that do not change over the evaluation cycle • Precision • How many items in the profile does the user agree with as representative of his interests ? • Does the user agree with the strength of the interest ? • Do interests at deeper levels of the hierarchy have less precision compared to interests at higher levels ? • Which data sources (bookmarks, search keywords, web pages) is better ? • Bookmarks were not very representative of user interests in our study 42 22 January 2008
  • Profile evaluation Sample Evaluation of one profiling algorithm 0.8 0.7 0.6 0.5 Stability Stability_alpha 0.4 •Profiles are stable (fig 1) 0.3 Stability_date 0.2 •Profile elements with high support 0.1 have high precision (fig 2) 0 0 200 400 600 •Profile elements at all levels of the Number of web pages in cache hierarchy have similar precision (fig 3) Figure 1 1 1.2 0.95 1 0.9 Percent (%) 0.8 Precision Percentage in profile 0.85 0.6 Precis ion 0.8 0.4 0.75 0.2 0.7 0 Support > 5 3 < Support < 5 Support < 3 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Figure 2 Figure 3 43 22 January 2008
  • Managing the profile • Profiles may need to be expanded (bootstrapped) or pruned • Allowing users to manually edit their profiles to add/delete topics of interest was found to make performance worse (Jae-wook Ahn, WWW 2007) • Adding and deleting topics to profile harmed system performance • Deleting topics harmed performance four times more compared to adding topics • Some agents learn short term and long term profiles separately using different techniques (K-NN for short term interests, Naïve Bayes for long term interests) 44 22 January 2008
  • Personalizing Search Krishnan Ramanathan © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Personalized search • Search can be personalized based on • User profile • Current working context • Past search queries • Server side clickstreams • Personalized Pagerank • Determining user intent is hard (e.g query Visa) 46 22 January 2008
  • A generic personalized search algorithm using a user profile • Inputs- User profile, Search query • Output – A results vector reordered by the user’s preference • Steps • Send the query to a search engine • Results[] = A vector of the search engine’s results • For each item i in Results[] calculate the preference Pref [i] = α *Similarity(Results[i] , User Profile) + (1- α)*SearchEngineRank • Sort Results[] using Pref [i] as the comparator 47 22 January 2008
  • Current working context – JIT retrieval • Context includes time, location, applications currently running, documents currently opened, IM status • Use profile and current context to provide relevant (and just-in-time) information • Blinkx toolbar – provides relevant news, video and Wikipedia articles within different applications (Micrsoft Word, IE browser) • Intersectinterests from the overall profile with current context to get the contextual profile • Context can also be used in query expansion 48 22 January 2008
  • Personalization based on Search history • Use query-to-query similarity to suggest results that satisfied past queries • Create user profiles from past queries/snippets from search results clicked • Misearch (Gauch et.al 2004) creates weighted concept hierarchies based on ODP as the reference concept hierarchy • Compute degree of similarity between search engine result snippets (title and text summaries) and user profile as n sim ( user i , doc j ) = ∑ wp k =1 ik * wd jk wp ik = weight of concept k in profile i wd jk = weight of concept k in document j 49 22 January 2008
  • Personalization by clickthrough data analysis – CubeSVD (Jian-Tao Sun, WWW 2005) • Search engine has tuples of the form (User, Query, Visited page) • Multiple tuples constitute a tensor (generalization of matrix to higher dimensions) • Higher order SVD (HOSVD) performs SVD on tensor • The reconstructed tensor is a tuple of the form (User, Query, web page, p) • Where p is the probability that the user posing the query will visit the web page • Recommend pages with highest value of p • Computationally intensive but HOSVD can be done offline • Need to recompute to account for new clickthrough data 50 22 January 2008
  • Topic sensitive pagerank (Haveliwala 2002) • For top 16 ODP categories, create a pagerank vector • Each web page/document d has multiple ranks depending on what the topic of interest j is • For a query compute, P(Cj|q) = P(Cj)*P(q,Cj) • Intuition: If a topic is more probable given a query, the topic specific rank should have more say in the final rank • Compute query sensitive rank as ∑ P(C j | q) * rank jd 51 22 January 2008
  • Topics • Overview of Personalization • User Profile creation • Personalizing Search • Document modeling • Recommender System • Semantics in Personalization 52 22 January 2008
  • Document modeling Somnath Banerjee © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM): A machine learning algorithm 54 22 January 2008
  • Document representation • Term vector • Document is represented as vector of terms • Each dimension corresponds to a separate term • Several methods of computing the weights of the terms • Binary weighting: 1 if the word appear in the document • Most well known is TF*IDF ni , j tf i , j = ∑n k k, j D idf i = log {d j : ti ∈ d j } tfidf i , j = tf i , j × idf i 55 22 January 2008
  • Computing similarity sim ( A, B ) = cos ine (θ ) = A•B = ∑ A ×Bi i A 2 × B 2 ∑A ∑B i 2 i 2 AI B Jaccard coefficien t = J ( A, B ) = AU B 2 AI B Dice' s coefficien t = D ( A, B ) = A+ B 56 22 January 2008
  • Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition • g2: Google Closes In on DoubleClick Acquisition • g3: FTC clears Google DoubleClick deal • g4: US regulator clears DoubleClick deal • g5: DoubleClick deal brings greater focus on privacy • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • Underlined words appeared in more than one documents 57 22 January 2008
  • Term Document Matrix (X) g1 g2 g3 g4 g5 e1 e2 e3 google 1 1 1 0 0 0 0 0 green 1 0 0 0 0 0 0 1 ftc 1 0 1 0 0 0 0 0 doubleclick 1 1 1 1 1 0 0 0 acquisition 1 1 0 0 0 0 0 0 clear 0 0 1 1 0 0 0 0 deal 0 0 1 1 1 0 0 0 eu 0 0 0 0 0 1 1 1 aviation 0 0 0 0 0 1 1 1 emmision 0 0 0 0 0 1 1 0 58 22 January 2008
  • Retrieval example • Query (or Profile) q = “Google Acquisition” • Query vector q = [1 0 0 0 1 0 0 0 0 0]' • Cosine similarity of the query to the documents g1 g2 g3 g4 g5 e1 e2 e3 S= 0.634 0.816 0.447 0 0 0 0 0 • What about the documents g4 and g5? • Problem of data sparsity 59 22 January 2008
  • Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM) ): A machine learning algorithm 60 22 January 2008
  • Latent Semantic Analysis (LSA) • You searching for “Tata Nano” are not the documents containing “People’s Car” also relevant? • How a machine can understand that? • Analyze the collection of documents • Documents that contain “Tata Nano” generally contain “People’s Car” as well • Covariance of these two dimensions are high • LSA finds such correlation using a technique from linear algebra 61 22 January 2008
  • LSA • Transforms the term document matrix into a relation between the • terms and some concepts, • relation between those concepts and the documents • Concepts are the dimensions of maximum variance • Removes the dimensions with low variance • Reduction in feature space • Term document matrix becomes denser 62 22 January 2008
  • Singular Value Decomposition documents •1 •2 •3 … D' terms X = T S •m mxm mxd txd txm •1• •2 • … • •m>0 m is the rank of the matrix X T and D are orthonormal matrix S is a diagonal matrix of singular values 63 22 January 2008
  • Reduced SVD documents •1 •2 •3 Dk ' … = Tk Sk terms Xk •k mxk mxk txd txk -Choose largest k singular values (•1… •k) -Choose k columns of T and D -Then construct Xk -Xk is the best k rank approximation of X in terms of Frobenius norm 64 22 January 2008
  • Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition • g2: Google Closes In on DoubleClick Acquisition • g3: FTC clears Google DoubleClick deal • g4: US regulator clears DoubleClick deal • g5: DoubleClick deal brings greater focus on privacy • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • Query (or Profile) q = “Google Acquisition” 65 22 January 2008
  • Term Document Matrix (X) g1 g2 g3 g4 g5 e1 e2 e3 google 1 1 1 0 0 0 0 0 green 1 0 0 0 0 0 0 1 ftc 1 0 1 0 0 0 0 0 doubleclick 1 1 1 1 1 0 0 0 acquisition 1 1 0 0 0 0 0 0 clear 0 0 1 1 0 0 0 0 deal 0 0 1 1 1 0 0 0 eu 0 0 0 0 0 1 1 1 aviation 0 0 0 0 0 1 1 1 emmision 0 0 0 0 0 1 1 0 66 22 January 2008
  • LSA Example T(10x7) = S(7x7) = D‘(7x8) = 67 22 January 2008
  • LSA Example • Rank 2 approximation of X documents terms 68 22 January 2008
  • LSA Example • Query (or Profile) q = “Google Acquisition” • Query vector q = [1 0 0 0 1 0 0 0 0 0]' • Representation of the query Dq = q'T2S2 -1 = [-0.204 0.005 ] • Query to document similarity Sim = Dq S22 D2' 69 22 January 2008
  • LSA Example Dq S22 X X D2' Sim = 70 22 January 2008
  • Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition [1.28 4] • g2: Google Closes In on DoubleClick Acquisition [0.936] • g3: FTC clears Google DoubleClick deal [1.426] • g4: US regulator clears DoubleClick deal [0.891] • g5: DoubleClick deal brings greater focus on privacy [0.697] • e1: EU Agrees to Reduce Aviation Emissions [0.035] • e2: Aviation to be included in EU emissions trading [0.035] • e3: EU wants tougher green aviation laws [0.152] • Underlined words appeared in more than one documents 71 22 January 2008
  • Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM) ): A machine learning algorithm 72 22 January 2008
  • Probabilistic Latent Semantic Analysis (PLSA) • If we know the document collection contains two topics can we do better? • Can we estimate • Probability( topic | document) ? • Probability( word | topic) ? • If we can also estimate Probability( topic | query) then we can compute the document to query similarity • PLSA is a statistical technique to estimate those probability from a collection of documents 73 22 January 2008
  • Probabilistic Latent Semantic Analysis (PLSA) • Dyadic data: Two (abstract) sets of objects, X ={x1, ..,xm} and Y ={y1, … ,yn} in which observations are made of dyads(x,y) • Simplest case: observation of co-occurrence of x and y • Other cases may involve scalar weight for each observation • Examples: • X = Documents, Y =Words • X = Users, Y =Purchased Items • X = Pixels, Y =Values 74 22 January 2008
  • PLSA • Document consists of topics and words in the document are generated based on those topics • Generative model (asymmetric): (di, wj) is generated as follow • pick a document with probability P(di), • pick a topic zk with probability P(zk | di), • generate a word wj with probability P(wj | zk) ( ) ( P d i , w j = P(d i )P w j | d i ) P(di) P(zk |di) P(wj |zk) ( ) ∑ P(w j | z k )P(z k | d i ) K D Z W P w j | di = k =1 75 22 January 2008
  • PLSA • Parameters P(di), P(zk | di), P(wj | zk) • P(di) is proportional to number of times the document is observed and be computed independently • P(zk | di), P(wj | zk) can be estimated using Expectation Maximization (EM) algorithm ∏∏ P(d , w ) N M P ( D, W ) = i j n(di ,w j ) i =1 j =1 ∑∑ n(d , w )ln P(d , w ) M N L= i j i j i =1 j =1 M = Number of documents; N = Number of distinct words 76 22 January 2008
  • PLSA: EM steps • E-Step: ( ) P z k | di , w j = ( ) P w j | z k P(z k | d i ) ∑ P(w ) K j | zl P(zl | d i ) l =1 M-Step: ∑ n(d , w )P(z ) • N i j k | di , w j ( P w j | zk = ) i =1 M N ∑∑ n(d , w m =1 i =1 i m )P(z k | d i , wm ) ∑ n(d , w )P(z ) M i j k | di , w j P (z k | d i ) = j =1 n( d i ) 77 22 January 2008
  • PLSA Example g1 google e1 eu g1 green e1 aviation •Dyadic data in our g1 ftc e1 emission example g1 doubleclick e2 aviation g1 acquisition e2 eu g2 google e2 emission g2 doubleclick e3 eu g2 acquisition e3 green g3 ftc e3 aviation g3 clear g3 google g3 doubleclick g3 deal g4 clear g4 doubleclick g4 deal g5 doubleclick g5 deal 78 22 January 2008
  • PLSA Example • After 20 iterations of EM algorithm P(zk |di) P(wj |zk) 79 22 January 2008
  • PLSA Example • Query q = “Google Acquisition” • Steps • Keep P(wj |zk) fixed. • Estimate P(zk |q) using EM steps • Then compute cosine similarity of the vector P(Z|q) to the P(Z|d) Z1 Z2 q 1 0 P(zk |q) 80 22 January 2008
  • Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition [1.0] • g2: Google Closes In on DoubleClick Acquisition [1.0] • g3: FTC clears Google DoubleClick deal [1.0] • g4: US regulator clears DoubleClick deal [1.0] • g5: DoubleClick deal brings greater focus on privacy [1.0] • e1: EU Agrees to Reduce Aviation Emissions [0.0] • e2: Aviation to be included in EU emissions trading [0.0] • e3: EU wants tougher green aviation laws [0.0] • Underlined words appeared in more than one documents 81 22 January 2008
  • Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM) ): A machine learning algorithm 82 22 January 2008
  • Document Classification 83 22 January 2008
  • Document classification with SVM • We will concentrate on binary classification • {sports, not sports}, {interesting, not interesting} etc • In general {+1,-1} also called {positive, negative} • SVM is a supervised machine learning technique. It learns the pattern from a training set • Training set • A set of documents with labels belonging to {+1, -1} • SVM tries to draw a hyperplane that best separates the positive and negative data in the training set 84 22 January 2008
  • Support Vector Machine (SVM) • A Machine learning algorithm • SVM was introduced in COLT-92 by Boser, Guyon and Vapnik. • Initially popularized in the NIPS community, now an important and active field of all Machine Learning Research • Successful applications in many fields (text, bioinformatics, handwriting, image recognition etc.) 85 22 January 2008
  • SVM – Maximum margin separation SVM illustration by Bülent Üstün Radboud Universiteit 86 22 January 2008
  • Mapping to higher dimension for non- separable data P1 • (0,0) x {+1} P2 • (0,1) x {-1} P2 P3 P3 • (1,1) x {+1} P4 • (1,0) x {-1} P1 P4 P1 • (0,0,0) x {+1}  x  2  1 x → φ (x ) →  x  2 2 P2 • (0,1,0) x {-1} x x  P3 • (1,1,1) x {+1}  1 2 P4 • (1,0,0) x {-1} 87 22 January 2008
  • The XOR example SVM uses kernel trick to map data to higher dimensional feature space without incurring much computational overhead 88 22 January 2008
  • Recommender System Select top N items for a user -Somnath Banerjee © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Example 90 22 January 2008
  • Example 91 22 January 2008
  • Classification • Broadly three approaches • Content Based Recommendation • Collaborative Filtering • Hybrid approach 92 22 January 2008
  • Content based recommendation • Utility of an item for a user is determined based on the items preferred by the user in the past • Applies similar techniques as introduced in the document modeling part 93 22 January 2008
  • Basic Approach • Create and represent the user profile from the items rated by the user in the past • A popular choice of profile representation is vector of terms weighted based on TF*IDF • Represent the item in the same format • A news item can be represented using (TF*IDF) term vector • For movies, books one needs to get sufficient metadata to represent the item in vector format • Define a similarity measure to compute the similarity between the profile and the item • Popular choice is cosine similarity • Advance machine learning techniques can also be applied to do the matching • Recommend most similar items 94 22 January 2008
  • Problems with content based recommendation • Knowledge engineering problem • How do you describe multimedia, graphics, movies, songs • Recommendation shows limited diversity • New user problem • It requires large number of ratings from the user to generate quality recommendation 95 22 January 2008
  • Collaborative filtering • Recommends items that are liked in the past by other users with similar tastes • Quite popular in e-commerce sites, like Amazon, eBay • Can recommend various media types, text, video, audio, Ads, products 96 22 January 2008
  • 97 22 January 2008
  • Advantages • Does not have the knowledge engineering problem • Both user and items can be represented using just ids • Often recommendation shows good amount of diversity 98 22 January 2008
  • Lets learn C.F. with an example Ran Casablanca Ben Tomb Raider MI -II Air Force Hur One Jane 5 5 ? 2 Bill 2 3 4 Tom 2 2 5 5 Cathy 3 3 1 1 What rating Jane will possibly give to MI – II? 99 22 January 2008
  • Normalizing the ratings • All users won’t give equal rating even if they all equally liked/disliked an item • Normalize rating r = ru ,i − ru Ran Casablanca Ben Tomb MI -II Air Force Hur Raider One Jane 1 1 -2 Bill -1 0 1 Tom -1.5 -1.5 1.5 1.5 Cathy 1 1 -1 -1 100 22 January 2008
  • Similarity between users • Who are the other users with similar taste like Jane • Each row of the matrix is a vector representing the user • Compute cosine similarity between the users Bill Tom Cathy Jane -0.289 -0.612 0.816 101 22 January 2008
  • Compute probable rating • Possible rating is the rating given by the other users weighted by the similarity • Sometimes only top N similar users are taken ∑ sim(u, v )∗ (r − r ) v∈V v ,i v ru ,i = ru + ∑ sim(u, v ) v∈V Jane will rate MI-II as (−0.289 × 1) + (−0.612 × 1.5) + (0.816 × −1) • = 4+ 0.289 + 0.612 + 0.816 ≈ 2.82 102 22 January 2008
  • Remarks • There is another popular version of the above technique where instead of user to user similarity item to item similarity is computed • Rating prediction is based on the similarity to the items rated by the user • The above mentioned methods are known as memory based techniques • It has the disadvantage that it require more online computations 103 22 January 2008
  • Model based technique • A model is learnt using the collection of ratings as training set • Prediction is done using the model • More offline computing and less online computing 104 22 January 2008
  • Model based technique • A simple model ru ,i = E (ru ,i ) = ∑ r × Pr (r u ,i = r | ru , s′ , s ′ ∈ I u ) r∈R 105 22 January 2008
  • Model based technique • Recent research tries to model the recommendation process with more complex probabilistic models u z i r P(r | u , i ) = ∑ P(r | z, i )× P(z | u ) z • Parameters P(r|z,i) and P(z|u) can be estimated using EM algorithm 106 22 January 2008
  • Problems of C.F. • New user problem • New Item problem • Sparsity problem • A user rates only a few items • Unusual user • User whose tastes are unusual compared to the rest of the population 107 22 January 2008
  • Hybrid approaches - Combining Collaborative and Content based methods • Combining predictions of Content based method and C.F. • Implement separate content based and collaborative filtering method • Combine their predictions using • Linear combination • Voting schemes • Alternatively select a prediction method based on some confidence measure on the recommendation 108 22 January 2008
  • Hybrid Approaches • Adding content based characteristics into a C.F. based method • Maintain a content based profile for each user • Use these content based profiles (not the commonly rated items) to compute the similarity between users • Then do C.F. • Helps to overcome sparsity related problems as generally not many items are commonly rated two users 109 22 January 2008
  • Hybrid approaches • Adding C.F. characteristics into a content based method • Most popular techniques in this category is dimensionality reduction on a group of content based profiles • Dimensionality reduction technique like LSA can improve prediction quality by having compact representation of profile 110 22 January 2008
  • Future directions of research (Adomavicious et al) • Incorporating richer user and item profile in a unified framework of different methods • Using contextual information in recommendation • Example: Recommending a vacation package the system should consider • User • Time of the year • With whom the user plans to travel • Traveling conditions and restrictions at the time • Multi-Criteria ratings • E.g. three criteria restaurant ratings food, décor and service 111 22 January 2008
  • Future directions of research • Non-intrusiveness • Flexibility • Enabling end-users to customize recommendation • Evaluation • Empirical evaluation on test data that users choose to rate • Items that users choose to rate are likely to be biased • Economics-oriented measures 112 22 January 2008
  • References (Recommender System) • Adomavicius, G., and Tuzhilin, A., “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and possible Extensions”, IEEE Transaction on Knowledge and Data Engineering, 2005 113 22 January 2008
  • Semantics in Personalization Geetha Manjunath Hewlett Packard Labs India © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Topic Outline • Why use semantic information? • Introduction to Ontology • Formal Specification of an Ontology • A Quick Overview of Semantic Web • Techniques and Approaches • Word Sense Disambiguation • Semantic Profiles • Constrained Spreading Activation • Semantic Similarity • Looking Ahead 115 22 January 2008
  • News Example Revisited • g1: Google Gets Green Light from FTC for DoubleClick Acquisition • g2: Google Closes In on DoubleClick Acquisition • g3: FTC clears Google DoubleClick deal • g4: US regulator clears DoubleClick deal • g5: DoubleClick deal brings greater focus on privacy • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • 116 22 January 2008
  • News Example Modified • g1: Apple Gets Green Light from FTC for TripleClick Acquisition • g2: Apple Closes In on TripleClick Acquisition • g3: FTC clears Apple TripleClick deal • g4: US regulator clears TripleClick deal IT company Google • g5: TripleClick deal brings greater focus on privacy Acquisition Acquisition • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • f1: Apple prices soaring high. • f2: Increased apple rates causes concern to doctors. • f3: Cost of 10 kg of apple to become Rs 1000 from 1 Feb. 117 22 January 2008
  • Semantics for Personalization Profile Represent Search query, Representation Content Profiles as news, video, using domain … meaningful concepts concepts Explicit and User Implicit info Profile Profile Data Profile to Content Collection Constructor Matching Semantics based Matching Function Implicit Expand the User Personalized services Cluster Documents Info based generated documents on domain profile using based on knowledge domain info better User 118 22 January 2008 groups
  • Techniques and Approaches 1. Implicit Information based on domain knowledge • Word Sense Disambiguation 2. Represent Profiles as meaningful concepts • Semantic Profiles 3. Semantics based Matching Function • Semantic Distance 4. Expand the generated profile using domain info • Constrained Spreading Activation 5. Cluster documents based on better User groups • Social Semantic Networks 119 22 January 2008
  • Word Sense disambiguation Animal Using Wordnet Transport Mammal Vehicle Hyponyms Meronyms Carnivore Motor Vehicle tail Accelerator fur Feline Automobile Door nail contains Bumper Big cat Car Wheel type of Synonyms Jaguar Panther Jaguar same as 120 22 January 2008
  • Word Sense disambiguation Abstract entity entity group substance employee organization Advisory board solid stocks eat animal institution food plant Revenue ripe Business Acquisition tree company Sales tax fruit plant skin ….. seed Apple Apple pulp KEY: Additional domain information 121 22 January 2008
  • Three level Conceptual Network • Domain Ontology • Co-occurrence • synonyms • hyponyms • .. • Hyperlinks • Order of access • Browsed together •… 122 22 January 2008
  • Introduction to Ontologies © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Views on Ontologies TopicMaps Front-End Thesauri Navigation Taxonomies Information Retrieval Query Expansion Sharing of Knowledge Queries Ontologies Semantic Networks Consistency Checking EAI Mediation Reasoning Extended ER-Models Predicate Logic Back-End 124 22 January 2008
  • Structure of an Ontology Ontologies typically have two components: • Names for important concepts in the domain • Elephant is a concept whose members are a kind of animal • Herbivore is a concept whose members are exactly those animals who eat only plants or parts of plants • Background knowledge/constraints on the domain • No individual can be both a Herbivore and a Carnivore 125 22 January 2008
  • A Simple Ontology Object Is a Is a knows Described in Person Topic Document writes Is a Student Researcher Semantics Ontology Is a similar PhD Student Described in Is about Topic Document Document Topic Is about writes knows Person Document Topic Person Topic 126 22 January 2008
  • Defining Ontology [Gruber, 1993] An Ontology is a formal specification Ø Executable of a shared Ø Group of persons conceptualization Ø About concepts of a domain of interest. Ø Application & “unique truth” •Formal description of concepts and their relationships •Strong Basis in the family of First Order Logics (DL) •Deductive Inference based on ground truth of the domain. 127 22 January 2008
  • Formal Specification of Ontologies Semantic Web: A quick introduction © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • The Semantic Web Vision Semantic web aims to transform WWW into a global database “The semantic web is a web for computers” 129 22 January 2008
  • Semantic web Make web resources more accessible to automated processes • Extend existing rendering markup with semantic markup • Metadata annotations that describe content/funtion of web accessible resources • Use Ontologies to provide vocabulary for annotations • “Formal specification” is accessible to machines • A prerequisite is a standard web ontology language • Need to agree common syntax before we can share semantics • Syntactic web based on standards such as HTTP and HTML 130 22 January 2008
  • Semantic Web Layers Context for vocabulary Globally User definable, Unambiguous domain specific Identifiers markup 131 22 January 2008
  • What is RDF ? • RDF – resource description framework • RDF is a data model • Statement-based approach • Subject/predicate/object triples – simple powerful unit • All resources identified by URIs • Triples create a directed labelled graph of • object/attribute/value • (semantic) relationships between objects • RDF model is an abstract layer independent of XML • XML serialization is supported 132 22 January 2008
  • RDF Example resource value ../presentation.ppt property dc:creator dc:date dc:description people.com/../dave_reynolds Some starter slides… org:email 2005-09-23 mailto:dave.reynolds@hp.com <rdf:Description rdf:about=“allppt.com/presentation.pptquot;> <dc:creator resource=“people.com/person/dave_reynoldsquot;/> </rdf:Description> Enables easy merge of information <rdf:Description rdf:ID=“people.com/person/dave_reynoldsquot;> • Indirect metadata (anyone can say anything about anything) <org:email resource= “mailto:dave.reynolds@hp.com” /> • Extensibility (open world assumption, compositional) </rdf:Description> 133 22 January 2008
  • RDF Schema • Defines small vocabulary for RDF: • Class, subClassOf, type rdfs:Resource • Property, subPropertyOf rdfs:subClassOf • domain, range Veh: MotorVehicle • Vocabulary can be used to define other vocabularies for yourrdfs:subClassOf application domain Veh: Van Veh: Truck Veh: PassengerVehicle rdfs:subClassOf Veh: MiniVan 134 22 January 2008
  • OWL – Web Ontology Language • A language to express an ontology • An OWL ontology is an RDF graph • A set of RDF triples • Vocabulary Extension Domain Restrictions/Truth • Structure • Ontology headers Important Concepts of the Domain • Class Axioms • Class Descriptions, Enumeration, Membership Restrictions • Property Axioms • Property Descriptions, Property Restrictions, Functional Spec • Facts about individuals 135 22 January 2008
  • OWL Class Constructors 136 22 January 2008
  • The Syntax Parent = Person with at least one child <owl:Class rdf:ID=“Parent”> <owl:intersectionOf > <owl:Class rdf:about=quot;#Personquot;/> <owl:Restriction> <owl:onProperty rdf:resource=quot;#hasChildquot;/> <owl:minCardinality>1</owl:minCardinality> </owl:Restriction> </owl:intersectionOf> </owl:Class> 137 22 January 2008
  • OWL Axioms 138 22 January 2008
  • SPARQL • RDF Query Language • Triples with unbound variables • Protocol • HTTP binding • SOAP binding • XML Results Format • Easy to transform (XSLT, XQuery) 139 22 January 2008
  • Why Ontologies? • Enable formalisation of user preferences • Common underlying, interoperable representation • Public vocabulary agreed & shared between different systems • Better content matching & sharing across applications • User interests can be matched to content meaning • Using conceptual reasoning • Richer, more precise, less ambiguous than keyword-based • Provides adequate grounding for hierarchical representation • coarse to fine-grained user interests • Formal, computer processable meaning on the concepts 140 22 January 2008
  • Semantic User Profiles © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Semantic Profiles • User Profile as concepts • Books, Clothes and Soccer Web pages visited by Top the user Shopping Science Sports ..… ..… W=2 W=0 W=1 Books Clothes Soccer Cricket W=1 W=1 W=1 W=0 How do we map documents/users to concepts? 142 22 January 2008
  • Building concept profiles based on ODP The Machine Learning Approach ODP Training ODP categories + classifier documents Step 1: Build ODP classifier for selected ODP categories User web ODP ODP pages classifier concepts Step 2: Use user data and ODP classifier to build the user profile Add to profile 143 22 January 2008
  • Topic Hierarchy from ODP / DMOZ 144 22 January 2008
  • Using Wikipedia to map documents to concepts Item: “Sony to slash PlayStation3 price” Term vector Representation: <sony:1>,<slash:1>, <playstation3:1>,<price:1> Item: “Jittery Sony Knocks $100 Off PS3 Price Tag” Term vector Representation: <jittery:1>, <sony:1>, <knocks:1> <ps3:1>,<price:1>, <tag:1> Additional features: titles of the retrieved articles query 1. PlayStation Network Platform 2. PlayStation 2 3. Ducks demo 4. PlayStation 3 Sony to slash PlayStation3 price 5. PlayStation 6. Ken Kutaragi 7. PlayStation Portable 8. Console manufacturer 9. Sony Group Index of Wikipedia dump 10. Crystal Dynamics 11. PlayStation 3 accessories 12. … 13. … A Search Approach 145 22 January 2008
  • Profile: Words Vs Concepts TF * IDF based user profile Wikipedia Based user profile Search Text Retrieval Conference Home HTML element Help Bank of America News Google search Privacy ICICI Bank Google IDBI Bank Terms Bank fraud New Artificial neural network Page Web crawler Use Web design Web Debit card View Extensible Markup Language Results Hewlett-Packard Information Microsoft Account XHTML Demand account 146 22 January 2008
  • Semantic Profiles • Vector of weights – representing the intensity of user interest for each concept (-1 to 1) • Content also described by a set of weighted concepts (0 to 1) • Concept Profiles: Can express fine grained interests • Interest in atheletes who have won a gold medal • Interest in IT companies which have acquired atleast 3 companies in the last one year • Only movies with either Amitabh or Sharukh 147 22 January 2008
  • Ontology-based Profile Spreading © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Profile Expansion • Use inference mechanism to enhance personalisation • Synonym expansion • Interest in multiple subclasses implies broader interest • Transitive closure (locatedIn, subtopic) • Interest in superclass leads to potential interest in subclass • Guess changing interest over time 149 22 January 2008
  • Constrained Spreading Artificial Intelligence Machine Learning Neural Networks 150 22 January 2008
  • Constrained Spreading Activation • Cannot take ‘all’ related data • Commonly used SA models • Distance Constraint • Fan-out Constraint • Path Constraints • App dependent inference rules • Type of relationship • Preferential paths • Activation Constraint • Threshold function at each single node level 151 22 January 2008
  • Learning preferences using semantic links Two main ways of updating Concept History Stack 1. Interest Assumption Completion • Add more potential user interests • Based on Hierarchical relationships • Threshold on value of pseudo-occurrence for insertion • Nocc (C supertype) = γ * Nocc (C subtype) where γ < 1 is determined empherically • Based on Semantic relationships • All related concepts such that ∃ prop p, p (C, C related) • Pseudo-occurrence Nocc (Crelated) = αi* Nocc (C) 152 22 January 2008
  • Learning preferences using semantic links (contd) 2. Preference update by expansion • Re-weighting over time • Wnew (Crelated) = Wold (Crelated) + βi * Wnew(C) • βI – Semantic Factor that depends on the level of semantic proximity • Directly part of definition (Tbox) • Related through inferred transitive relation (# such links matter) • Notion of Semantic distance 153 22 January 2008
  • Semantic Similarity © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Similarity/Matching • Cosine similarity •U represents user preference •D represents content object •Dimension: #concepts in the ontology similarity ( U, D )= cos (U ,D) = U• D = ∑ U ×Di i U × D 2 2 ∑U ∑D i 2 i 2 155 22 January 2008
  • Semantic distance d(x,y,c) • Semantic distance between 2 nodes x and y is defined with respect to a concept, c • Example: a black cat and an orange cat • very similar as instances of the category Animal, since their common catlike properties would be the most significant for distinguishing them from other kinds of animals. • But in the category Cat, they would share their catlike properties with all the other kinds of cats, and the difference in color would be more significant. • In the category BlackEntity, color would be the most relevant property, and the black cat would be closer to a crow or a lump of coal than to the orange cat. 156 22 January 2008
  • Semantic Similarity Mtrl: Material Accm: Accompaniment 157 22 January 2008
  • Using Wordnet (hypernyms) 158 22 January 2008
  • Match all nodes 159 22 January 2008
  • Similarity Formula 160 22 January 2008
  • Looking Ahead © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • Contextual Personalisation • Finer, qualitative, context sensitive activation of user pref • Notion of a Semantic Runtime Context • Representation: Vector of concept weights • Fuzzy semantic intersection between user preferences and runtime context • Using Constrained spreading activation 162 22 January 2008
  • Semantic Social Networking • Identify hidden links between users • Similarity between user preferences • Collaborative Recommender systems • Use of Global Preferences not correct • Partial & Strong Similarities are very useful • Eg: Coinciding interest in cinema but drastically different in sports 163 22 January 2008
  • Semantic Social Networks 164 22 January 2008
  • Microformats Metadata Social links Geo Outline hResume adr Licensing tags http://microformats.org • Microformats are small bits of HTML that represent things like people, events, tags, etc. in web pages. • Building blocks that enable users to own, control, move, and share their data on the Web. • Microformats enable • publishing of higher fidelity information on the Web, • the fastest and simplest way to support feeds and APIs for your website. 165 22 January 2008
  • 166 22 January 2008
  • eRDF • A subset of RDF embedded into XHTML or HTML by using common idioms and attributes. • No new elements or attributes have been invented and the usages of the HTML attributes are within normal bounds. • This scheme is designed to work with CSS and other HTML support technologies. • HTML Embeddable RDF. • all HTML Embeddable RDF is valid RDF, not all RDF is Embeddable RDF 167 22 January 2008
  • GRDDL • Gleaning Resource Descriptions from Dialects of Languages • Obtaining RDF data from XHTML pages • Explicitly associated transformation algorithms (XSLT) 168 22 January 2008
  • Acknowledgements • Self-tuning Personalized Information Retrieval in an Ontology-Based Framework, Pablo Castells, Miriam Fernández, David Vallet, et al, OTM Workshop 2005 • An Approach for Semantic Search by Matching RDF Graphs, Haiping Zhu, Jiwei Zhong, Jianming Li and Yong Yu • Semantic Web Tutorials 169 22 January 2008
  • Concluding Remarks • Personalization: An upcoming area of technology • Personalization aims at faster access to information to improve user productivity • Server-side Vs Client-side personalization • Technologies • Machine Learning techniques • Semantic Web • New Markup Languages • Challenges • Understanding the user behaviour, intentions, likes, … • Relating human edited content to the profile 170 22 January 2008
  • Thank you Questions? © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice