Lise Getoor, "

3,518 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,518
On SlideShare
0
From Embeds
0
Number of Embeds
2,050
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lise Getoor, "

  1. 1. Link MiningLise GetoorLi G tUniversity of Maryland, College Park August 22, 2012
  2. 2. Alternate Title….. WhatMachine Learning/Statistics/Data Mining can do for YOU!1.Predict future values2.Fill-in missing values Supervised Learning3 Identify anomalies What are some common3.Identify machine learning algorithms?4.Find patterns Unsupervised Learning5.Identify Clusters
  3. 3. So, what’s Link Mining??? Machine learning when you have graphs (or networks)  Nodes are entities • People • Places • Organizations • Text  Links are relationships p • Friends • MemberOf • LivesIn • Tweeted • Posted  e.g., heterogeneous multi-relational data, multimodal data …..
  4. 4. Ex: Social Media Relationships User-User Friends Collaborators Family Ua Ub Fan/Follower Replies Co-Edits Co-Mentions, etc. User Doc User-Doc U Doc1 Comments Edits, etc.U Q URL User-Query-ClickU Tag Doc User-Tag-Doc
  5. 5. Link Mining Tasks Node Labeling Link Prediction Entity Resolution G oup etect o Group Detection
  6. 6. Node Labeling What is Harry’s h political persuasion? Harry Natasha
  7. 7. Link Prediction Friends?
  8. 8. Entity Resolution Aka: deduplication, co-reference resolution, record linkage, reference consolidation, etc. g
  9. 9. Abstract Problem StatementReal Digital WorldWorld Records / Mentions
  10. 10. Deduplication Problem Statement Cluster the records/mentions that correspond to same entityy
  11. 11. Deduplication Problem Statement Cluster the records/mentions that correspond to same entityy  Intensional Variant: Compute cluster representative
  12. 12. Record Linkage Problem Statement Link records that match across databases BA
  13. 13. Reference Matching Problem Match noisy records to clean records in a reference table Reference Table T bl
  14. 14. InfoVis Co-Author Network Fragment before after
  15. 15. Group Detection
  16. 16. Link Mining Algorithms Node Labeling Link Prediction Entity Resolution G oup etect o Group Detection
  17. 17. Link Mining Algorithms Node Labeling 1. Relational Classifiers 2. Collective Classifiers Link Prediction Entity Resolution G oup etect o Group Detection
  18. 18. Relational Classifiers Given: a w b 1 5 2 x c d 3 y 4 e z Task: Predict attribute Alternate task: Predict existence of some of the entities of relationship between entities ?1 ? 1 2 ? ?2 ? 1 3 ? ... relational features... . ?5 ? 4 5 ? same-attribute-value local features avg value of l f neighbors hb number of shared neighbors number of neighbors participate in relation
  19. 19. Relational Classifiers Values are represented as a fixed-length feature vector Instances are treated independently of each other Relational features are computed by aggregating over related entities Any classification or regression model can be used for learning and prediction
  20. 20. Application Case Studies Two example applications that use relational classifiers  Focus is on types of relational features used Case Study 1: Predicting click-through rate of search result ads Case St d 2 P di ti f i d hi i a social C Study 2: Predicting friendships in i l network
  21. 21. Case Study 1: Predicting Ad Click-Through Rate Click Through Task: Predict the click through rate (CTR) of an click-through online ad, given that it is seen by the user, where the ad is described by:  URL to which user is sent when clicking on ad  Bid terms used to determine when to display ad  Title d text f d Titl and t t of ad Our description is based on approach by  [Richardson et al., WWW07]
  22. 22. Relational Features Used Average CTR Average CTR CTR? Ad Ad1 Ad2 Ad3 Ad4 Ad5 Ad6 contains-bid-term BT1 BT2 BT3 BT4 BT5 BT6contains-bid-term t i bid t(according to search engine) related-bid-term (containing subsets or supersets of the term) … … … queried-bid-term … Count Count
  23. 23. Case Study 2: Predicting Friendships Task: Predict new friendships among users, based users on their descriptive attributes, their existing friendships, and their family ties. p , y Our description is based on approach by p pp y  [Zheleva et al., SNAKDD08]
  24. 24. Relational Features Used “Petworks” - social networks of pets count, density P3 P8 count, proportion P6 P9 P4 count count t P7 P5 P10 P1 P2 Friends? P11 F1 Jaccard coeff in-family F2 same-breed same breed
  25. 25. Key Idea: Feature Construction Feature informativeness is key to the success of a relational classifier Features can be:  Attributes of entity/entities  Match predicate on attributes of entities  Attributes of related entities  Encode structural features  Based on overlap in sets o erlap
  26. 26. Link Mining Algorithms Node Labeling 1. Relational Classifiers 2. Collective Classifiers Link Prediction Entity Resolution G oup etect o Group Detection
  27. 27. Collective Classification [Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08] Extends relational classifiers by allowing relational features to be functions of predicted attributes/relations of neighbors At training time, these features are computed based on observed values in the training set At i f inference ti time, th algorithm it t the l ith iterates, computing ti relational features based on the current prediction for any unobserved attributes  In the first, bootstrap, iteration, only local features are used
  28. 28. CC: Learning label set: P2 P4 P1 P3 P10 P8 P5 P6 P9 P7 Learn models (local and relational) f L d l (l l d l ti l) from fully labeled training set
  29. 29. CC: Inference (1) P1 P2 P5 P3 P4Step 1 B t tSt 1: Bootstrap using entity attributes only i tit tt ib t l
  30. 30. CC: Inference (2) P1 P2 P5 P3 P4Step 2 Iteratively d t thSt 2: It ti l update the category of each entity, t f h tit based on related entities’ categories
  31. 31. CC Key Idea Rather than make predictions independently, begin with relational classifier, and then ‘propagate’ p p g classification Variations:  Propagate probabilities, rather than mode (related to Gibbs Sampling)  Batch vs. Incremental updates  Ordering strategies Active area of research: active learning, semi semi- supervised learning, more principled joint probabilistic models, etc.
  32. 32. Link Mining Algorithms Node Labeling Link Prediction Entity Resolution G oup etect o Group Detection
  33. 33. The Entity Resolution Problem James John Smith Smith “John Smith” “Jim Smith” “J Smith” “James Smith James Smith”Jonathan Smith “Jon Smith” “J Smith” “Jonthan Smith” Issues: 1. Identification 2. Disambiguation
  34. 34. Relational Identification Very similar names. Added evidence from shared co-authors
  35. 35. Relational Disambiguation Very similar names but no shared collaborators
  36. 36. Collective Entity Resolution One resolution provides evidence for another => joint j resolution
  37. 37. P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson JP2: “Partitioning Mapping of Unstructured Meshes to Partitioning Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus JP3: “Dynamic Mesh Partitioning: A Unied Optimisation and Dynamic Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. EverettP4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman JP5: “Deterministic Parsing of Ambiguous Grammars”, A. g g Aho, S. Johnson, J. Ullman JP6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
  38. 38. P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. JohnsonP2: “Partitioning Mapping of Unstructured Meshes to Partitioning Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManusP3: “Dynamic Mesh Partitioning: A Unied Optimisation and Dynamic Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. EverettP4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. UllmanP5: “Deterministic Parsing of Ambiguous Grammars”, A. g g Aho, S. Johnson, J. UllmanP6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
  39. 39. Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
  40. 40. Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
  41. 41. Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
  42. 42. Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
  43. 43. Cut-based Formulation of RC-ER M. G. Everett S. Johnson M. G. Everett S. Johnson M. Everett S. Johnson M. Everett S. Johnson S. Johnson S. Johnson A. Aho A. Aho Stephen C. Stephen C. Alfred V. Aho Johnson Alfred V. Aho JohnsonGood separation of attributes Worse in terms of attributesMany cluster-cluster relationships Fewer cluster-cluster relationships Aho-Johnson1 Aho-Johnson2 Aho Johnson1, Aho Johnson2,  Aho-Johnson1 Everett Johnson2 Aho Johnson1, Everett-Johnson2 Everett-Johnson1
  44. 44. Objective Function Minimize:  w sim i i j A A (ci ,c j )  wR simR (ci , c j ) i weight for similarity of weight for Similarity based on relational attributes attributes relations edges between ci and cj Greedy clustering algorithm: merge cluster pair with max reduction in objective function  (ci ,c j ) w A sim A (ci ,c j )  w R (|N (ci )||N (c j )|) Similarity of attributes Common cluster neighborhood
  45. 45. Relational Clustering Algorithm1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert into priority queue4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8.8 Update similarity for ‘related’ clusters related O(n l O( k log n) algorithm w/ efficient i l ) l ith / ffi i t implementation t ti
  46. 46. Evaluation Datasets CiteSeer  1,504 citations to machine learning papers (Lawrence et al.)  2,892 references to 1,165 author entities arXiv  29,555 publications from High Energy Physics (KDD Cup’03)  58,515 refs to 9,200 authors Elsevier BioBase  156,156 Biology papers (IBM KDD Challenge ’05)  831,991 author refs  Keywords, topic classifications, language, country and affiliation of corresponding author, etc p g ,
  47. 47. Baselines A: Pair-wise duplicate decisions w/ attributes only  Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler  Other textual attributes: TF-IDF A*: Transitive closure over A A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N Evaluate pair-wise decisions over references F1-measure F1 measure (harmonic mean of precision and recall)
  48. 48. ER over Entire Dataset Algorithm CiteSeer arXiv BioBase A 0.980 0.976 0.568 A* 0.990 0.971 0.559 A+N 0.973 0.938 0.710 A+N A+N* 0.984 0 984 0.934 0 934 0.753 0 753 RC-ER 0.995 0.985 0.818 RC-ER outperforms baselines in all datasets Collective resolution better than naïve relational resolution
  49. 49. ER over Entire Dataset Algorithm CiteSeer arXiv BioBase A 0.980 0.976 0.568 A* 0.990 0.971 0.559 A+N 0.973 0.938 0.710 A+N A+N* 0.984 0 984 0.934 0 934 0.753 0 753 RC-ER 0.995 0.985 0.818 CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6 500 additional correct resolutions; 20% error reduction 6,500 BioBase: Biggest improvement over baselines
  50. 50. Flipside….
  51. 51. Privacy breaches in OSNs Identity disclosure  A mapping from a record Who is ? to a specific individual Attribute disclosure  Find attribute value that the Is liberal? user intended to stay private Social link disclosure  Participation in a sensitive Friends? relationship or communication p Affiliation link disclosure Support gay  Participation in a group revealing a sensitive attribute value marriage
  52. 52. Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks Identifying R l in Social Networks Id tif i Roles i S i l N t k Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse level Discourse-level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:  D-Dupe, C G DD C-Group, G-Pare GP Others … http://www.cs.umd.edu/linqs
  53. 53. Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks Identifying R l in Social Networks Id tif i Roles i S i l N t k Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse level Discourse-level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:  D-Dupe, C G DD C-Group, G-Pare GP Others … http://www.cs.umd.edu/linqs
  54. 54. Conclusion Link mining algorithms can be useful tools for social media Need algorithms that can handle the multi-modal, multi-relational, temporal nature of social media Collective algorithms make use of  Structure to define features and propagate information, allows us to improve the overall accuracy i f ti ll t i th ll While there are important pitfalls to take into account (confidence and privacy) there are privacy), many potential benefits and payoffs (improved personalization and context-aware predictions!) context aware
  55. 55. http://www.cs.umd.edu/linqs Work sponsored by the National Science Foundation,Maryland Industrial Partners (MIPS), National Geospatial Agency (MIPS) Agency, Airforce Research Laboratory, DARPA, Google, Microsoft, and Yahoo!

×