Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

3,393 views

Published on

Abstract:
One of the challenges in big data analytics lies in being able to reason collectively about extremely large, heterogeneous, incomplete, noisy interlinked data. We need data science techniques which an represent and reason effectively with this form of rich and multi-relational graph data. In this presentation, I will describe some common collective inference patterns needed for graph data including: collective classification (predicting missing labels for nodes in a network), link prediction (predicting potential edges), and entity resolution (determining when two nodes refer to the same underlying entity). I will describe three key capabilities required: relational feature construction, collective inference, and scaling. Finally, I briefly describe some of the cutting edge analytic tools being developed within the machine learning, AI, and database communities to address these challenges.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

  1. 1. Big Graph Data Science Lise Getoor University of California, Santa Cruz SF MLConf November 14, 2014
  2. 2. BIG Data is not flat ©2004-2013 lonnitaylor
  3. 3. Data is multi-modal, multi-relational, spatio-temporal, multi-media
  4. 4. NEED: Data Science for Graphs New V: Vinculate
  5. 5. Pa#erns Key Ideas Tools
  6. 6. Pa#erns Key Ideas Tools
  7. 7. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on
  8. 8. Collec&ve Classifica&on: inferring the labels of nodes in a graph
  9. 9. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend Question: or ?
  10. 10. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend
  11. 11. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend ? ? ?
  12. 12. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on ✔
  13. 13. Link Predic&on: inferring the existence of edges in a graph
  14. 14. Link Prediction ¢ Entities l People, Emails ¢ Observed relationships l communications, co-location ¢ Predict relationships l Supervisor, subordinate, colleague  #         
  15. 15. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on ✔ ✔
  16. 16. En&ty Resolu&on: determining which nodes refer to same underlying en2ty
  17. 17. Ironically, Entity Resolution has many duplicate names Identity uncertainty Doubles Record linkage Duplicate detection Deduplication Reference reconciliation Object identification Object consolidation Coreference resolution Entity clustering Household matching Householding Reference matching Fuzzy match Approximate match Merge/purge Hardening soft databases
  18. 18. Entity Resolution & Network Analysis Tutorial on Entity Resolution in Big Data w/ Ashwin Machanavajjhala @ KDD13 (slides available) before after
  19. 19. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on ✔ ✔ ✔
  20. 20. Graph Iden2fica2on • Goal: – Given an input graph infer an output graph • Three major components: – En&ty Resolu&on (ER): Infer the set of nodes – Link Predic&on (LP): Infer the set of edges – Collec&ve Classifica&on (CC): Infer the node labels • Challenge: The components are intra and inter-­‐dependent Namata, Kok, Getoor, KDD 2011
  21. 21. Pa#erns Key Ideas Tools
  22. 22. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking
  23. 23. Key Idea #1: Relational Feature Construction
  24. 24. Key Idea: Feature Construction ¢ Feature informativeness is key to the success of a relational classifier ¢ Relational feature construction l Node-specific measures • Aggregates: summarize attributes of relational neighbors • Structural properties: capture characteristics of the relational structure l Node-pair measures: summarize properties of (potential) edges • Attribute-based measures • Edge-based measures • Neighborhood similarity measures
  25. 25. Relational Classifiers ¢ Pros: l Efficient: Can handle large amounts of data • Features can often be pre-computed ahead of time l Flexible: Can take advantage of well-understood classification/ regression algorithms l One of the most commonly-used ways of incorporating relational information ¢ Cons: l Features are based on observed values/evidence, cannot be based on attributes or relations that are being predicted l Makes incorrect independence assumptions l Hard to impose global constraints on joint assignments • For example, when inferring a hierarchy of individuals, we may want to enforce constraint that it is a tree
  26. 26. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking ✔
  27. 27. Collective Classification ? 
  28. 28. Collective Classification Donates(A, ) => Votes(A, ): 5.0 ? $ $  Tweet Status update Mentions(A, “Affordable Health”) => Votes(A, ): 0.3
  29. 29. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend
  30. 30. Collective Classification vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3    spouse spouse friend friend   colleague  spouse colleague vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8 friend friend
  31. 31. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend
  32. 32. Link Prediction § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  #         
  33. 33. Link Prediction HasWord(E, “due”) => Type(E, deadline) : 0.6 § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  # complete by  due        
  34. 34. Link Prediction § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  #          Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8
  35. 35. Link Prediction § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  #          Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞
  36. 36. Entity Resolution § Entities - People References § Attributes - Name § Relationships - Friendship § Goal: Identify references that denote the same person John Smith J. Smith name name A B C E friend friend D F G H = =
  37. 37. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = =
  38. 38. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, A.name ≈{str_sim} B.name => A≈B : 0.8 they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = =
  39. 39. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = = {A.friends} ≈{} {B.friends} => A≈B : 0.6
  40. 40. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = = A≈B ^ B≈C => A≈C : ∞
  41. 41. Challenges ¢ Collective Classification: labeling nodes in graph l irregular structure, not a chain, not a grid l Challenge: One large partially labeled graph ¢ Link prediction: predicting edges in graph l Dependencies among edges l Don’t want to reason about all possible edges l Challenge: scaling & extremely skewed probabilities ¢ Entity resolution: determine nodes that refer to same entities in a graph l Dependencies between clusters l Challenge: enforcing constraints, e.g. transitive closure
  42. 42. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking ✔ ✔
  43. 43. Blocking: Motivation ¢ Naïve pairwise: |N|2 pairwise comparisons l 1000 business listings each from 1,000 different cities across the world l 1 trillion comparisons l 11.6 days (if each comparison is 1 μs) ¢ Mentions from different cities are unlikely to be matches l Blocking Criterion: City l 1 billion comparisons l 16 minutes (if each comparison is 1 μs)
  44. 44. Blocking: Motivation ¢ Mentions from different cities are unlikely to be matches l May miss potential matches 47
  45. 45. Blocking: Motivation Matching Pairs of Nodes Set of all Pairs of Nodes Pairs of Nodes sa&sfying Blocking criterion
  46. 46. Blocking Algorithms 1 ¢ Hash based blocking l Each block Ci is associated with a hash key hi. l Mention x is hashed to Ci if hash(x) = hi. l Within a block, all pairs are compared. l Each hash function results in disjoint blocks. ¢ What hash function? l Deterministic function of attribute values l Boolean Functions over attribute values [Bilenko et al ICDM’06, Michelson et al AAAI’06, Das Sarma et al CIKM ‘12] l minHash (min-wise independent permutations) [Broder et al STOC’98]; locality sensitive hashing
  47. 47. Blocking Algorithms 2 ¢ Pairwise Similarity/Neighborhood based blocking l Nearby nodes according to a similarity metric are clustered together l Results in non-disjoint canopies. ¢ Techniques l Sorted Neighborhood Approach [Hernandez et al SIGMOD’95] l Canopy Clustering [McCallum et al KDD’00]
  48. 48. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking ✔ ✔ ✔
  49. 49. Pa#erns Key Ideas Tools
  50. 50. http://psl.umiacs.umd.edu Matthias Stephen Bach Broecheler Alex Memory Lily Mihalkova Stanley Kok Bert Huang Angelika Kimmig Ben London Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao
  51. 51. http://psl.umiacs.umd.edu Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems - Predicate = relationship or property - Atom = (continuous) random variable - Rule = capture dependency or constraint - Set = define aggregates PSL Program = Rules + Input DB
  52. 52. Collective Classification vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3    spouse spouse friend friend   colleague  spouse colleague vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8 friend friend
  53. 53. PSL Foundations § PSL makes large-scale reasoning scalable by mapping logical rules to convex functions § Three principles justify this mapping: - LP programs for MAX SAT with approximation guarantees [Goemans and Williamson, ’94] - Pseudomarginal LP relaxations of Boolean Markov random fields [Wainwright, et al., ’02] - Łukasiewicz logic, a logic for reasoning about continuous values [Klir and Yuan, ‘95]
  54. 54. Hinge-loss Markov Random Fields P(Y|X) = 1 Z exp 2 4

×