• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Lise Getoor, "
 

Lise Getoor, "

on

  • 2,052 views

 

Statistics

Views

Total Views
2,052
Views on SlideShare
408
Embed Views
1,644

Actions

Likes
0
Downloads
6
Comments
0

2 Embeds 1,644

http://www.cs.umd.edu 1642
http://suebecks.blogspot.co.uk 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lise Getoor, " Lise Getoor, " Presentation Transcript

    • Link MiningLise GetoorLi G tUniversity of Maryland, College Park August 22, 2012
    • Alternate Title….. WhatMachine Learning/Statistics/Data Mining can do for YOU!1.Predict future values2.Fill-in missing values Supervised Learning3 Identify anomalies What are some common3.Identify machine learning algorithms?4.Find patterns Unsupervised Learning5.Identify Clusters
    • So, what’s Link Mining??? Machine learning when you have graphs (or networks)  Nodes are entities • People • Places • Organizations • Text  Links are relationships p • Friends • MemberOf • LivesIn • Tweeted • Posted  e.g., heterogeneous multi-relational data, multimodal data …..
    • Ex: Social Media Relationships User-User Friends Collaborators Family Ua Ub Fan/Follower Replies Co-Edits Co-Mentions, etc. User Doc User-Doc U Doc1 Comments Edits, etc.U Q URL User-Query-ClickU Tag Doc User-Tag-Doc
    • Link Mining Tasks Node Labeling Link Prediction Entity Resolution G oup etect o Group Detection
    • Node Labeling What is Harry’s h political persuasion? Harry Natasha
    • Link Prediction Friends?
    • Entity Resolution Aka: deduplication, co-reference resolution, record linkage, reference consolidation, etc. g
    • Abstract Problem StatementReal Digital WorldWorld Records / Mentions
    • Deduplication Problem Statement Cluster the records/mentions that correspond to same entityy
    • Deduplication Problem Statement Cluster the records/mentions that correspond to same entityy  Intensional Variant: Compute cluster representative
    • Record Linkage Problem Statement Link records that match across databases BA
    • Reference Matching Problem Match noisy records to clean records in a reference table Reference Table T bl
    • InfoVis Co-Author Network Fragment before after
    • Group Detection
    • Link Mining Algorithms Node Labeling Link Prediction Entity Resolution G oup etect o Group Detection
    • Link Mining Algorithms Node Labeling 1. Relational Classifiers 2. Collective Classifiers Link Prediction Entity Resolution G oup etect o Group Detection
    • Relational Classifiers Given: a w b 1 5 2 x c d 3 y 4 e z Task: Predict attribute Alternate task: Predict existence of some of the entities of relationship between entities ?1 ? 1 2 ? ?2 ? 1 3 ? ... relational features... . ?5 ? 4 5 ? same-attribute-value local features avg value of l f neighbors hb number of shared neighbors number of neighbors participate in relation
    • Relational Classifiers Values are represented as a fixed-length feature vector Instances are treated independently of each other Relational features are computed by aggregating over related entities Any classification or regression model can be used for learning and prediction
    • Application Case Studies Two example applications that use relational classifiers  Focus is on types of relational features used Case Study 1: Predicting click-through rate of search result ads Case St d 2 P di ti f i d hi i a social C Study 2: Predicting friendships in i l network
    • Case Study 1: Predicting Ad Click-Through Rate Click Through Task: Predict the click through rate (CTR) of an click-through online ad, given that it is seen by the user, where the ad is described by:  URL to which user is sent when clicking on ad  Bid terms used to determine when to display ad  Title d text f d Titl and t t of ad Our description is based on approach by  [Richardson et al., WWW07]
    • Relational Features Used Average CTR Average CTR CTR? Ad Ad1 Ad2 Ad3 Ad4 Ad5 Ad6 contains-bid-term BT1 BT2 BT3 BT4 BT5 BT6contains-bid-term t i bid t(according to search engine) related-bid-term (containing subsets or supersets of the term) … … … queried-bid-term … Count Count
    • Case Study 2: Predicting Friendships Task: Predict new friendships among users, based users on their descriptive attributes, their existing friendships, and their family ties. p , y Our description is based on approach by p pp y  [Zheleva et al., SNAKDD08]
    • Relational Features Used “Petworks” - social networks of pets count, density P3 P8 count, proportion P6 P9 P4 count count t P7 P5 P10 P1 P2 Friends? P11 F1 Jaccard coeff in-family F2 same-breed same breed
    • Key Idea: Feature Construction Feature informativeness is key to the success of a relational classifier Features can be:  Attributes of entity/entities  Match predicate on attributes of entities  Attributes of related entities  Encode structural features  Based on overlap in sets o erlap
    • Link Mining Algorithms Node Labeling 1. Relational Classifiers 2. Collective Classifiers Link Prediction Entity Resolution G oup etect o Group Detection
    • Collective Classification [Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08] Extends relational classifiers by allowing relational features to be functions of predicted attributes/relations of neighbors At training time, these features are computed based on observed values in the training set At i f inference ti time, th algorithm it t the l ith iterates, computing ti relational features based on the current prediction for any unobserved attributes  In the first, bootstrap, iteration, only local features are used
    • CC: Learning label set: P2 P4 P1 P3 P10 P8 P5 P6 P9 P7 Learn models (local and relational) f L d l (l l d l ti l) from fully labeled training set
    • CC: Inference (1) P1 P2 P5 P3 P4Step 1 B t tSt 1: Bootstrap using entity attributes only i tit tt ib t l
    • CC: Inference (2) P1 P2 P5 P3 P4Step 2 Iteratively d t thSt 2: It ti l update the category of each entity, t f h tit based on related entities’ categories
    • CC Key Idea Rather than make predictions independently, begin with relational classifier, and then ‘propagate’ p p g classification Variations:  Propagate probabilities, rather than mode (related to Gibbs Sampling)  Batch vs. Incremental updates  Ordering strategies Active area of research: active learning, semi semi- supervised learning, more principled joint probabilistic models, etc.
    • Link Mining Algorithms Node Labeling Link Prediction Entity Resolution G oup etect o Group Detection
    • The Entity Resolution Problem James John Smith Smith “John Smith” “Jim Smith” “J Smith” “James Smith James Smith”Jonathan Smith “Jon Smith” “J Smith” “Jonthan Smith” Issues: 1. Identification 2. Disambiguation
    • Relational Identification Very similar names. Added evidence from shared co-authors
    • Relational Disambiguation Very similar names but no shared collaborators
    • Collective Entity Resolution One resolution provides evidence for another => joint j resolution
    • P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson JP2: “Partitioning Mapping of Unstructured Meshes to Partitioning Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus JP3: “Dynamic Mesh Partitioning: A Unied Optimisation and Dynamic Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. EverettP4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman JP5: “Deterministic Parsing of Ambiguous Grammars”, A. g g Aho, S. Johnson, J. Ullman JP6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
    • P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. JohnsonP2: “Partitioning Mapping of Unstructured Meshes to Partitioning Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManusP3: “Dynamic Mesh Partitioning: A Unied Optimisation and Dynamic Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. EverettP4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. UllmanP5: “Deterministic Parsing of Ambiguous Grammars”, A. g g Aho, S. Johnson, J. UllmanP6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
    • Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
    • Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
    • Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
    • Relational Clustering (RC-ER)P1 C. Walshaw M. Cross M. G. Everett S. JohnsonP2 C. C Walshaw M. M Cross M. Everett S. Johnson K. K McManusP4 Alfred V. Aho Jefferey D. Ullman Stephen C. JohnsonP5 A. Aho J. Ullman S. Johnson
    • Cut-based Formulation of RC-ER M. G. Everett S. Johnson M. G. Everett S. Johnson M. Everett S. Johnson M. Everett S. Johnson S. Johnson S. Johnson A. Aho A. Aho Stephen C. Stephen C. Alfred V. Aho Johnson Alfred V. Aho JohnsonGood separation of attributes Worse in terms of attributesMany cluster-cluster relationships Fewer cluster-cluster relationships Aho-Johnson1 Aho-Johnson2 Aho Johnson1, Aho Johnson2,  Aho-Johnson1 Everett Johnson2 Aho Johnson1, Everett-Johnson2 Everett-Johnson1
    • Objective Function Minimize:  w sim i i j A A (ci ,c j )  wR simR (ci , c j ) i weight for similarity of weight for Similarity based on relational attributes attributes relations edges between ci and cj Greedy clustering algorithm: merge cluster pair with max reduction in objective function  (ci ,c j ) w A sim A (ci ,c j )  w R (|N (ci )||N (c j )|) Similarity of attributes Common cluster neighborhood
    • Relational Clustering Algorithm1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert into priority queue4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8.8 Update similarity for ‘related’ clusters related O(n l O( k log n) algorithm w/ efficient i l ) l ith / ffi i t implementation t ti
    • Evaluation Datasets CiteSeer  1,504 citations to machine learning papers (Lawrence et al.)  2,892 references to 1,165 author entities arXiv  29,555 publications from High Energy Physics (KDD Cup’03)  58,515 refs to 9,200 authors Elsevier BioBase  156,156 Biology papers (IBM KDD Challenge ’05)  831,991 author refs  Keywords, topic classifications, language, country and affiliation of corresponding author, etc p g ,
    • Baselines A: Pair-wise duplicate decisions w/ attributes only  Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler  Other textual attributes: TF-IDF A*: Transitive closure over A A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N Evaluate pair-wise decisions over references F1-measure F1 measure (harmonic mean of precision and recall)
    • ER over Entire Dataset Algorithm CiteSeer arXiv BioBase A 0.980 0.976 0.568 A* 0.990 0.971 0.559 A+N 0.973 0.938 0.710 A+N A+N* 0.984 0 984 0.934 0 934 0.753 0 753 RC-ER 0.995 0.985 0.818 RC-ER outperforms baselines in all datasets Collective resolution better than naïve relational resolution
    • ER over Entire Dataset Algorithm CiteSeer arXiv BioBase A 0.980 0.976 0.568 A* 0.990 0.971 0.559 A+N 0.973 0.938 0.710 A+N A+N* 0.984 0 984 0.934 0 934 0.753 0 753 RC-ER 0.995 0.985 0.818 CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6 500 additional correct resolutions; 20% error reduction 6,500 BioBase: Biggest improvement over baselines
    • Flipside….
    • Privacy breaches in OSNs Identity disclosure  A mapping from a record Who is ? to a specific individual Attribute disclosure  Find attribute value that the Is liberal? user intended to stay private Social link disclosure  Participation in a sensitive Friends? relationship or communication p Affiliation link disclosure Support gay  Participation in a group revealing a sensitive attribute value marriage
    • Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks Identifying R l in Social Networks Id tif i Roles i S i l N t k Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse level Discourse-level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:  D-Dupe, C G DD C-Group, G-Pare GP Others … http://www.cs.umd.edu/linqs
    • Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks Identifying R l in Social Networks Id tif i Roles i S i l N t k Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse level Discourse-level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:  D-Dupe, C G DD C-Group, G-Pare GP Others … http://www.cs.umd.edu/linqs
    • Conclusion Link mining algorithms can be useful tools for social media Need algorithms that can handle the multi-modal, multi-relational, temporal nature of social media Collective algorithms make use of  Structure to define features and propagate information, allows us to improve the overall accuracy i f ti ll t i th ll While there are important pitfalls to take into account (confidence and privacy) there are privacy), many potential benefits and payoffs (improved personalization and context-aware predictions!) context aware
    • http://www.cs.umd.edu/linqs Work sponsored by the National Science Foundation,Maryland Industrial Partners (MIPS), National Geospatial Agency (MIPS) Agency, Airforce Research Laboratory, DARPA, Google, Microsoft, and Yahoo!