Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Probabilistic Logic Learning Ramya Ramakrishnan Link Mining
  2. 2. Link Mining - Overview <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Tasks </li></ul><ul><li>Authorities and Hubs </li></ul><ul><li>Challenges </li></ul><ul><li>Prob. Model of Group Membership and Link Generation </li></ul><ul><li>Summary </li></ul><ul><li>Reference </li></ul>
  3. 3. Introduction <ul><li>Traditional Data Mining – </li></ul><ul><li> R andom sample of homogeneous objects from single relation. </li></ul><ul><li>Big challenge of Data Mining – </li></ul><ul><li>Tackling the problem of mining heterogeneous datasets which are multi relational. </li></ul><ul><li>Domain consists of a variety of object types and they are linked by – </li></ul><ul><li>- explicit link </li></ul><ul><li>- constructed link </li></ul>
  4. 4. Introduction – Contd... <ul><li>Statistical inference procedures: leads to inappropriate conclusions. </li></ul><ul><li>Handle Potential correlations </li></ul><ul><li>Exploit Record linkage </li></ul><ul><li>i.e. Information used to improve the predictive accuracy of the learned models </li></ul>
  5. 5. Link Mining <ul><li>Newly emerging research area that is at the intersection of the work in link analyis, hypertext and web mining, relational learning and inductive logic programming and graph mining. </li></ul><ul><li>Instance of multi-relational data mining. </li></ul><ul><li>Encompasses tasks such as descriptive and predictive modelling. </li></ul><ul><li>Examples are: </li></ul><ul><li>- predicting the type of link between two objects </li></ul><ul><li>- inferring the existence of a link </li></ul>
  6. 6. Linked Data <ul><li>Represented as a graph or network </li></ul><ul><ul><li>Nodes are objects </li></ul></ul><ul><ul><ul><li>May have different kinds of objects </li></ul></ul></ul><ul><ul><ul><li>Objects have attributes </li></ul></ul></ul><ul><ul><ul><li>Objects may have labels or classes </li></ul></ul></ul><ul><ul><li>Edges are links </li></ul></ul><ul><ul><ul><li>May have different kinds of links </li></ul></ul></ul><ul><ul><ul><li>Links may have attributes </li></ul></ul></ul><ul><ul><ul><li>Links may be directed, are not required to be binary </li></ul></ul></ul>
  7. 7. Example: Linked Bibliographic Data P 2 P 4 A 1 P 3 P 1 I 1 Objects: Papers Authors Institutions Links: Citation Co-Citation Author-of Author-affiliation Papers P 2 P 4 P 3 P 1 Authors A 1 I 1 Institutions Citation Co-Citation Author-of Author-affiliation
  8. 8. Background <ul><li>To improve information retrieval results. </li></ul><ul><li>Page rank measure and hubs & authority scores. </li></ul><ul><li>Hypertext and web page classification. </li></ul><ul><li>Combines techniques of ILP with statistical learning algorithms. </li></ul><ul><li>Identifies certain types of hypertext regularities. </li></ul><ul><li>Identification of communities or groups based on link structure. </li></ul><ul><li>Social and collaborative filtering. </li></ul><ul><li>Probabilistic models for linked data. </li></ul>
  9. 9. Link Mining Tasks <ul><li>Link Based Classification </li></ul><ul><li>Link Based Cluster Analysis </li></ul><ul><li>Identifying Link Type </li></ul><ul><li>Predicting Link Strength </li></ul><ul><li>Link Cardinality </li></ul><ul><li>Record Linkage </li></ul><ul><li>Domains used are Web Page collection (web), Bibliographic domain(bib) and Epidemiological studies (epi). </li></ul>
  10. 10. Link-based Object Classification <ul><li>Predicting the category of an object based on its attributes and its links and attributes of linked objects </li></ul><ul><li>web : Predict the category of a web page </li></ul><ul><li>bib : Predict the topic of a paper </li></ul><ul><li>epi : Predict disease type based on characteristics of the people; </li></ul>
  11. 11. Link Type <ul><li>Predicting type or purpose of link </li></ul><ul><li>web : predict advertising link or navigational link </li></ul><ul><li>bib : predicting whether co-author is also an advisor; predict an advisor-advisee relationship </li></ul><ul><li>epi : predicting whether contact is familial, co-worker or acquaintance </li></ul>
  12. 12. Predicting Link Existence <ul><li>Predicting whether a link exists between two objects </li></ul><ul><li>web : predict whether there will be a link between two pages </li></ul><ul><li>bib : predicting whether a paper will cite another paper </li></ul><ul><li>epi : predicting who a patient’s contacts are </li></ul>
  13. 13. Link Cardinality Estimation I <ul><li>Predicting the number of links to an object </li></ul><ul><li>web : predict the authoritativeness of a page based on the number of in-links; identifying hubs based on the number of out-links </li></ul><ul><li>bib : predicting the impact of a paper based on the number of citations </li></ul><ul><li>epi : predicting the infectiousness of a disease based on the number of people diagnosed . </li></ul>
  14. 14. Link Cardinality Estimation II <ul><li>Predicting the number of objects reached along a path from an object </li></ul><ul><li>Important for estimating the number of objects that will be returned by a query </li></ul><ul><li>web : predicting number of pages retrieved by crawling a site </li></ul><ul><li>bib : predicting the number of citations of a particular author in a specific journal </li></ul><ul><li>epi : predicting the number of elderly contacts for a particular patient. </li></ul>
  15. 15. Object Identity <ul><li>Predicting when two objects are the same, based on their attributes and their links </li></ul><ul><li>eg: record linkage, duplicate elimination </li></ul><ul><li>web : predict when two sites are mirrors of each other. </li></ul><ul><li>bib : predicting when two citations are referring to the same paper. </li></ul><ul><li>epi : predicting when two disease strains are the same. </li></ul>
  16. 16. Authorities and Hubs hubs authorities
  17. 17. Authorities and Hubs <ul><li>Authoritative pages are pages that has large in-degree </li></ul><ul><li>Hub pages are pages that have links to multiple relevant authoritative pages. </li></ul><ul><li>A good hub is a page that points to many good authorities. </li></ul><ul><li>A good authority is a page that is pointed by many good hubs. </li></ul><ul><li>Therefore hubs and authorities exhibit a mutually reinforcing relationship. </li></ul>
  18. 18. Related work <ul><li>For defining notions of standing, impact and influence </li></ul><ul><li>Standing – In the study of social networks </li></ul><ul><li>Impact – In bibliometrics. Also used in hypertext and WWW Rankings . </li></ul><ul><li>Influence - In bibliometrics </li></ul>
  19. 19. Challenges <ul><li>Logical vs. Statistical Dependences </li></ul><ul><li>Feature Construction </li></ul><ul><li>Collective Classification </li></ul><ul><li>Effective Use of Unlabeled Data </li></ul><ul><li>Link Prediction </li></ul><ul><li>Object Identity </li></ul><ul><li>Statistical Challenges to Inductive Inference in Linked Data </li></ul><ul><li>Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few) </li></ul>
  20. 20. Logical vs. Statistical Dependences <ul><li>Challenge in link mining and multi – relational data mining is coherently handling 2 different types of dependence structures </li></ul><ul><li>- link structure : logical relationships between objects </li></ul><ul><li>- probabilistic dependency : statistical relationship between attribute of objects </li></ul><ul><li>Probabilistic dependence limited to be among objects that are logically related. </li></ul><ul><li>In learning statistical models for multi-relational data, we must not only search over probabilistic dependencies, but also search for diff. possible logical relationships between objects. </li></ul>
  21. 21. Feature Construction <ul><li>The attribute of an object provide a basic description of an object. </li></ul><ul><li>Traditional classification algorithms were based on these types of object features. </li></ul><ul><li>In link based approach, it may also make sense to use attributes of linked objects. Further if the links themselves have attributes, it may also be used. </li></ul><ul><li>This is the idea behind propostionalisation. </li></ul><ul><li>A main issue is how to deal with relationships that are not one-to-one; it may be appropriate to compute aggregate features over the set of related objects. </li></ul>
  22. 22. Collective Classification <ul><li>The challenge is classification using a learned model. </li></ul><ul><li>A learned link based model specifies a distirbution over link and content attributes, which may be correlated based on the links between them. </li></ul><ul><li>Intuitively for linked objects updating the category of one object can influence our inference about the categories of its linked neighbours. </li></ul><ul><li>Iterative classification algorithms have been proposed for hypertext categorization and relational learning. </li></ul><ul><li>This algorithm has been studied in various fields such as relation labeling in computer vision, inference in Markov random fields and loopy belief propogation in Bayesian networks. </li></ul><ul><li>Allows us to learn the notion of hubs. </li></ul>
  23. 23. Effective use of Unlabeled Data <ul><li>Unique ways in which unlabeled data can be used to improve classification performance in relational domains: </li></ul><ul><li>Links among the unlabeled data (or test set) can provide information that can help with classification. </li></ul><ul><li>Links between the labeled training data and unlabeled data induce dependencies that should not be ignored. </li></ul><ul><li>Just as in the case of the classical machine learning framework, in which there are no links among the data, unlabeled data can help us learn the distribution over object descriptions. </li></ul>
  24. 24. Link Prediction <ul><li>Challenge here is link discovery, or predicting the existence of links between objects. </li></ul><ul><li>A range of tasks that we have described fall under the category of link prediction. </li></ul><ul><li>The difficulty here is that the prior probability of a link among any set of individuals is typically very low. </li></ul><ul><li>A further challenge is the discovery of common relational patterns or subgraphs; some progress has been made but this is an inherently dífficult problem. </li></ul>
  25. 25. Object Identity <ul><li>Challenge is identity detection. </li></ul><ul><li>How do we infer aliases, i.e. determine that two objects refer to the same individual? </li></ul><ul><li>Also whether our statistical models refer explicitly to individuals or only to classes or categories of objects. </li></ul><ul><li>We would like to model that a connection to a particular object or individual is highly predictive </li></ul><ul><li>On the other hand we‘d like to have our models generalize and be applicable to new, unseen objects. </li></ul>
  26. 26. Statistical Challenges to Inductive Inference in Linked Data <ul><li>Statistical dependences </li></ul><ul><li>Sampling density </li></ul><ul><li>Feature combinatorics. </li></ul>
  27. 27. Statistical Dependencies <ul><li>Instance Linkage </li></ul><ul><ul><li>Independent Instances </li></ul></ul><ul><ul><li>Dependent Instances </li></ul></ul>A 1 B 2 A n ... ... ... B n A 2 B 1 B 1 A 1 A 2 A n
  28. 28. Sampling Density A 8 Partial Sampling A 1 A 2 A 0 A 3 A 4 A 7 A 6 A 5
  29. 29. Feature Combinatorics <ul><li>Linked data intensify a challenge – adjusting for multiple comparisons. </li></ul><ul><li>Other induction algorithms use a procedure – </li></ul><ul><ul><li>Generate n items </li></ul></ul><ul><ul><li>Calculate a score for each item based on the training set </li></ul></ul><ul><ul><li>Select the item with the maximum score </li></ul></ul><ul><li>Linked data intensify these challenges. </li></ul><ul><li>To adjust, techniques such as </li></ul><ul><ul><li>new data samples </li></ul></ul><ul><ul><li>cross validation </li></ul></ul><ul><ul><li>randomization test and </li></ul></ul><ul><ul><li>boneferroni adjustment. </li></ul></ul>
  30. 30. Probabilistic Model of Group Membership and Link Generation <ul><li>Model considers both observed link evidence and demographic information about the entities. </li></ul><ul><li>Parameters of the model are learned via a maximum likelihood search. </li></ul><ul><li>System takes 2 types of input data: </li></ul><ul><ul><ul><li>A database of entities and their demographic information </li></ul></ul></ul><ul><ul><ul><li>A database of link data </li></ul></ul></ul><ul><li>Outputs a set of group memberships which is used to answer queries such as – </li></ul><ul><ul><ul><li>List all members of group G1 </li></ul></ul></ul><ul><ul><ul><li>List all the groups for which E1 and E2 are both members </li></ul></ul></ul><ul><ul><ul><li>List a set of suspected aliases (entities that are in the same group(s), but never appear in the same link). </li></ul></ul></ul>
  31. 31. Probabilistic Model of Group Membership and Link Generation Demographic Model p(Member G1 | demogrpahics) classifier p(Member G2 | demogrpahics) classifier p(Member G3 | demogrpahics) classifier . . p(Member G6 | demogrpahics) classifier ----- hidden information Solid borders – observed data Link model Link data Chart Demo. Data France Student 18 Dickens USA Driver 30 Chapman USA Clerk 34 Brown Britain Teacher 24 Atkins Nationality Job Age Person 0.05 0.05 Email 0.01 0.01 Money 0.20 0.20 Meeting 0.03 0.03 Phone P R P l Link type * * G5 G6 G4 G2 G3 Group Person Dickens * * Chapman * Brown * * Atkins G1 : : Email {Atkins, Brown} Meeting {Brown, Dickens} Money {Atkins, Chapman} Type Persons
  32. 32. Summary <ul><li>Link mining </li></ul><ul><ul><li>exciting new research area </li></ul></ul><ul><ul><li>poses new statistical modeling challenges </li></ul></ul><ul><li>Link mining task should inform our choice of: </li></ul><ul><ul><li>Link-based statistical model </li></ul></ul><ul><ul><li>visualization </li></ul></ul>
  33. 33. Reference <ul><li>L. Getoor. Link Mining: A New data Mining Challenge. SIGKDD Explorations, volume 4, issue 2, 2003 . </li></ul><ul><li>P. Domingos and M. Richardson. Mining the network value of the customers. In Proceddings of the Seventh International Conference on Knowledge discovery and Data Mining , 2001. </li></ul><ul><li>D. Jensen. Statistical Challenges to inductive inferences in linked data. In Seventh International Workshop on AI and Statistics , 1999. </li></ul><ul><li>J. Kleinberg. Authoritative Sources in a hyperlinked environment. Journal of the ACM , 46(5):604-632, 1999. </li></ul><ul><li>J. Kubica, A. Moore, J. Schneider and Y. Yang. Stochastic link and group detection. In Proceedings of AAAI-02 , 2002. </li></ul><ul><li>L. Getoor, E. Segal, B. Taskar, D. Koller. Probabilistic Models of Text and Link Structure for Hypertext Classification. IJCAI Workshop on &quot;Text Learning: Beyond Supervision &quot;, Seattle, WA, August 2001 . </li></ul>