Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neo4j Data Science Presentation

232 views

Published on

Presentation of Neo4j and Graph Algorithms.

Published in: Engineering
  • Be the first to comment

Neo4j Data Science Presentation

  1. 1. github.com/maxdemarzi About 200 public repositories Max De Marzi Neo4j Field Engineer About Me ! 01 02 03 04 maxdemarzi.com @maxdemarzi
  2. 2. Experience Technical Doesn’t Matter 75 % 50% 95%
  3. 3. You go home, thinking about graphs All that matters You
  4. 4. Property 
 Graph It’s super simple. 
 
 All you get is: Property Graph Model Properties Nodes Relationships
  5. 5. What you (probably) already know:
  6. 6. Joins are executed every time you query the relationship Executing a Join means to search for a key B-Tree Index: O(log(n)) Your data grows by 10x, your time goes up by one step on each Join More Data = More Searches Slower Performance The Problem 1 2 3 4
  7. 7. Same Data, Different Layout No more Tables, no more Foreign Keys, no more Joins
  8. 8. Relational Databases can’t handle Relationships Degraded Performance Speed plummets as data grows and as the number of joins grows Wrong Language SQL was built with Set Theory in mind, not Graph Theory Not Flexible New types of data and relationships require schema redesign Wrong Model They cannot model or store relationships without complexity 1 2 3 4
  9. 9. NoSQL Databases can’t handle Relationships Degraded Performance Speed plummets as you try to join data together in the application Wrong Languages Lots of wacky “almost sql” languages terrible at “joins” Not ACID Eventually Consistent means Eventually Corrupt Wrong Model They cannot model or store relationships without complexity 1 2 3 4
  10. 10. What’s does this mean?
  11. 11. Double Linked List Relationships
  12. 12. What’s Our Secret Sauce?
  13. 13. Fixed Sized Records “Joins” on Creation Spin Spin Spin through this data structure Pointers instead of Lookups1 2 3 4
  14. 14. Remains steady as database grows Real Time Query Performance Connectedness and Size of Data Set ResponseTime 0 to 2 hops
 0 to 3 degrees
 Thousands of connections Tens to hundreds of hops
 Thousands of degrees
 Billions of connections Relational and
 Other NoSQL
 Databases Neo4j Neo4j is 
 1000x faster
 Reduces minutes 
 to milliseconds
  15. 15. I don’t know the average height of all hollywood actors, but I do know the Six Degrees of Kevin Bacon But not for every query
  16. 16. Reimagine your Data as a Graph Better Performance Query relationships in real time Right Language Cypher was purpose built for Graphs Flexible and Consistent Evolve your schema seamlessly while keeping transactions Right Model Graphs simplify how you think 1 2 3 4 Agile, High Performance and Scalable without Sacrifice
  17. 17. Modeling
  18. 18. Just draw stuff and “walla” there is your data model Graphs are Whiteboard Friendly
  19. 19. Movie Property Graph Some Models are Easy
  20. 20. Should Roles be their own Node? Some Models are Easy but not for all Questions
  21. 21. How do you model Flight Data?
  22. 22. Airports Nodes with Flying To Relationships How do you model Flight Data?
  23. 23. Maybe Flight should be its own Node? How do you model Flight Data?
  24. 24. Don’t we care about Flights only on particular Days? How do you model Flight Data?
  25. 25. What is this trick with the date in the relationship type? How do you model Flight Data?
  26. 26. We don’t need Airports if we model this way! How do you model Flight Data?
  27. 27. Lets get Creative
  28. 28. Group Destinations together! How do you model Flight Data?
  29. 29. OMG WAT! How do you model Flight Data?
  30. 30. Do not try and bend the data. That’s im possible.
  31. 31. If they can do it, you can do it! How do you model Comic Books?
  32. 32. More Modeling
  33. 33. Cloning Twitter Building a News Feed 9:00 am @hipster This is what I had for breakfast! <Insert Image of squirrel food> 8:30 am @neo4j Automated tweet telling me about Graph Connect 2017 in NYC on Oct 23-24 8:12 am @ex-coworker Stuff I no longer care about. 8:03 am @someguy Inspirational Quote of the Day
  34. 34. How do others do it? Cloning Twitter
  35. 35. How do others do it? Cloning Twitter
  36. 36. The Wrong Way Modeling a Twitter Feed
  37. 37. A Better Way Modeling a Twitter Feed
  38. 38. Bigger Model Modeling a Twitter Feed
  39. 39. Fixed Sized Records “Joins” on Creation Spin Spin Spin through this data structure Pointers instead of Lookups 1 2 3 4 Neo4j Secret Sauce Yet Again
  40. 40. MAKETHE QUERIES SCALE …and the database scales with them. …and that’s why we don’t make any money.
  41. 41. SCALING OUT IS IN FASHION But when your model and your query match you don’t have to.
  42. 42. getDegree is your Friend
  43. 43. This is Java. What happened to Cypher?
  44. 44. Java Core API
  45. 45. Easy to Learn (no really) Java Core API • Step by Step from GraphDatabaseService • Start a transaction (reads and writes) • findNode(Label, Property, Value) • findNodes(Label, Property, Value) • findNodes(Label) • getNodeById(Long) • getRelationships(Direction, Type) • getProperty(Property, (optional) Default Value)
  46. 46. Get friends of a User Java Core API
  47. 47. Traversal API
  48. 48. Interesting to Learn Traversal API • Start with the Simple Defaults (order, relationships, depth, uniqueness, etc) • Custom Expanders • Where should I go next • Custom Evaluators • I’ve gone there… should I accept this path?
  49. 49. Example Traversal API
  50. 50. Cypher
  51. 51. Cypher: Powerful and Expressive Query Language MATCH (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} ) LOVES Dan Ann Label Property Label Property Node Node
  52. 52. MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) WHERE boss.name = “John Doe” RETURN sub.name AS Subordinate, 
 count(report) AS Total Express Complex Queries Easily with Cypher Find all direct reports and 
 how many people they manage, 
 up to 3 levels down Cypher QuerySQL Query
  53. 53. Cypher Stored Procedures
  54. 54. Combine any APIs Cypher Stored Procedures
  55. 55. Switch to Decision Tree Deck
  56. 56. Use Cases
  57. 57. Understanding User Behavior EventsMetrics TargetingSearching Purchase History
  58. 58. Learn from the Experts • Alex Beutel, CMU • Leman Akoglu, Stony Brook • Christos Faloutsos, CMU • Graph-Based User Behavior Modeling: From Prediction to Fraud Detection • http://www.cs.cmu.edu/~abeutel/kdd2015_tutorial/
  59. 59. User Behavior Challenges • How can we understand normal user behavior?
  60. 60. User Behavior Challenges • How can we understand normal user behavior? • How can we find suspicious behavior?
  61. 61. User Behavior Challenges • How can we understand normal user behavior? • How can we find suspicious behavior? • How can we distinguish the two?
  62. 62. Does your little girl like Rambo?
  63. 63. Demographics: Age
  64. 64. Demographics: Gender
  65. 65. Do Little Girls like Movies other Little Girls Like?
  66. 66. Yes! Little Girls like Movies other Little Girls Like
  67. 67. What do Little Girls Like? MATCH (u:User)-[r:RATED]->(m:Movie)
 WHERE u.age = 1 AND u.gender = "F" AND r.stars > 3
 RETURN m.title, COUNT(r) AS cnt
 ORDER BY cnt DESC
 LIMIT 10
  68. 68. What do Little Girls Like?
  69. 69. What do Men 25-34 Like? MATCH (u:User)-[r:RATED]->(m:Movie)
 WHERE u.age = 25 AND u.gender = "M" AND r.stars > 3
 RETURN m.title, COUNT(r) AS cnt
 ORDER BY cnt DESC
 LIMIT 10
  70. 70. What do Men 25-34 Like?
  71. 71. Modeling “Normal” Behavior • Predict Edges
 (Similar Users)
  72. 72. Modeling “Normal” Behavior • Predict Edges
 (Movies I should Watch)
  73. 73. What Rating should I give 101 Dalmatians? MATCH (me:User {id:1})-[r1:RATED]->(m:Movie)
 <-[r2:RATED]-(:User)-[r3:RATED]->
 (m2:Movie {title:”101 Dalmatians”})
 WHERE ABS(r1.stars-r2.stars) <=1
 RETURN AVG(r3.stars)
  74. 74. Modeling “Normal” Behavior • Predict Edges • Predict Node Attributes • Predict Edge Attributes • Clustering and Community Detection
  75. 75. Predict a Star Rating purely on Demographics MATCH (u:User)-[r:RATED]->(m:Movie {title:”Toy Story”})
 WHERE u.age = 1 AND u.gender = "F" 
 RETURN AVG(r.stars)
  76. 76. Modeling “Normal” Behavior • Predict Edges • Predict Node Attributes • Predict Edge Attributes • Clustering and Community Detection • Fraud Detection
  77. 77. First-Party Fraud
  78. 78. First-Party Fraud • Fraudster’s aim: apply for lines of credit, act normally, extend credit, then…run off with it • Fabricate a network of synthetic IDs, aggregate smaller lines of credit into substantial value • Often a hidden problem since only banks are hit • Whereas third-party fraud involves customers whose identities are stolen • More on that later…
  79. 79. So what? • $10’s billions lost by banks every year • 25% of the total consumer credit write-offs in the USA • Around 20% of unsecured bad debt in E.U. and N.A. is misclassified • In reality it is first-party fraud
  80. 80. Fraud Ring
  81. 81. Then the fraud happens… • Revolving doors strategy • Money moves from account to account to provide legitimate transaction history • Banks duly increase credit lines • Observed responsible credit behavior • Fraudsters max out all lines of credit and then bust out
  82. 82. … and the Bank loses • Collections process ensues • Real addresses are visited • Fraudsters deny all knowledge of synthetic IDs • Bank writes off debt • Two fraudsters can easily rack up $80k • Well organized crime rings can rack up many times that
  83. 83. Probable Cohabiters Query MATCH (p1:Person)-[:HOLDS|LIVES_AT]->()
 <-[:HOLDS|LIVES_AT]-(p2:Person) WHERE p1 <> p2 RETURN DISTINCT p1
  84. 84. Probably Non-Fraudulent Cohabiters
  85. 85. Probable Cohabiters Query MATCH (p1:Person)-[:HOLDS|LIVES_AT*]->()
 <-[:HOLDS|LIVES_AT*]-(p2:Person) WHERE p1 <> p2 RETURN DISTINCT p1 The Star (*) means keep going.
  86. 86. Dodgy-Looking Chain
  87. 87. How does Neo4j fit with traditional fraud prevention? http://www.gartner.com/newsroom/id/1695014 Gartner’s Layered Fraud Prevention Approach
  88. 88. Two Sides of the Same Coin Recommendations • Add the relationship that does not exist Fraud Detection • Find the relationships that should not exist
  89. 89. Modeling User Behavior • Modeling normal users and detecting anomalies are two sides of understanding user behavior
  90. 90. Recommendation Engines
  91. 91. Hello World Recommendation
  92. 92. Hello World Recommendation
  93. 93. Movie Data Model
  94. 94. Cypher Query: Movie Recommendation MATCH (watched:Movie {title:"Toy Story”}) <-[r1:RATED]- (p2) -[r2:RATED]-> (unseen:Movie) WHERE r1.rating > 7 AND r2.rating > 7 AND p2.gender = “female” AND p2.age < 35 AND watched.genres = unseen.genres AND NOT( (p:Person) -[:RATED|WATCHED]-> (unseen) )
 AND p.username in [“maxdemarzi”,”janedoe”,”jamesdean”] RETURN unseen.title, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 25 What are the Top 25 Movies • that I haven't seen • with the same genres as Toy Story • given high ratings • by women under 35 who liked Toy Story
  95. 95. Let’s try k-nearest neighbors (k-NN) Cosine Similarity
  96. 96. Cypher Query: Ratings of Two Users MATCH (p1:Person {name:'Michael Sherman’}) -[r1:RATED]-> (m:Movie), (p2:Person {name:'Michael Hunger’}) -[r2:RATED]-> (m:Movie) RETURN m.name AS Movie, 
 r1.rating AS `M. Sherman's Rating`, r2.rating AS `M. Hunger's Rating` What are the Movies these 2 users have both rated
  97. 97. Cypher Query: Ratings of Two Users Calculating Cosine Similarity
  98. 98. Cypher Query: Cosine Similarity MATCH (p1:Person) -[x:RATED]-> (m:Movie) <-[y:RATED]- (p2:Person) WITH SUM(x.rating * y.rating) AS xyDotProduct, SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength, SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength, p1, p2 MERGE (p1)-[s:SIMILARITY]-(p2) SET s.similarity = xyDotProduct / (xLength * yLength) Calculate it for all Person nodes with at least one Movie between them
  99. 99. Movie Data Model
  100. 100. Cypher Query: k-NN Recommendation MATCH (m:Movie) <-[r:RATED]- (b:Person) -[s:SIMILARITY]- (p:Person {name:'Zoltan Varju'}) WHERE NOT( (p) -[:RATED|WATCHED]-> (m) ) WITH m, s.similarity AS similarity, r.rating AS rating ORDER BY m.name, similarity DESC WITH m.name AS movie, COLLECT(rating)[0..3] AS ratings WITH movie, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS recommendation ORDER BY recommendation DESC RETURN movie, recommendation
 LIMIT 25 What are the Top 25 Movies • that Zoltan Varju has not seen • using the average rating • by my top 3 neighbors
  101. 101. Graph Algorithms
  102. 102. Centralities • PageRank • ArticleRank • Betweenness Centrality (a) • Closeness Centrality (b) • Harmonic Centrality (e) • Eigenvector Centrality (c) • Degree Centrality (d)
  103. 103. Community Detection • Louvain • Label Propagation • Connected Components • Strongly Connected Components • Triangle Counting/Clustering Coefficient • Balanced Triads
  104. 104. Louvain
  105. 105. Path Finding • Minimum Weight Spanning Tree • Shortest Path • Single Source Shortest Path • All Pairs Shortest Path • A* • Yen’s K-Shortest Paths • Random Walk
  106. 106. Similarity • Jaccard Similarity • Cosine Similarity • Pearson Similarity • Euclidian Distance • Overlap Similarity
  107. 107. Link Prediction • Adamic Adar • Common Neighbors • Preferential Attachments • Resource Allocation • Same Community • Total Neighbors
  108. 108. Common Neighbors
  109. 109. Adamic Adar • Builds on Common Neighbors 
 Instead of just Count…compute: • The Sum of the Inverse Log of the 
 degree of each Neighbor See “Friends and neighbors on the Web” by Lada A. Adamic and Eytan Adar
  110. 110. Sub Graph Features
  111. 111. Ego-net Patterns
  112. 112. Ego-net Patterns Ni: number of neighbors of ego i Ei: number of edges in egonet i Wi: total weight of egonet i λw,i: principal eigenvalue of the weighted adjacency matrix of egonet i
  113. 113. Power Law Density slope=2 slope=1 slope=1.35
  114. 114. Power Law Weight
  115. 115. Power Law Eigenvalue
  116. 116. Why? Why are we doing all this?
  117. 117. Extract Features from Graph • One of the 1st steps in
 machine learning from 
 graphs is to extract 
 graph features. Structure of Data >> Data
  118. 118. Deep Neural Networks for Bank Fraud (2015) https://www.youtube.com/watch?v=TAer-PeIypI Fraud Detection starts about half-way (after intro)
  119. 119. What else?
  120. 120. Graph Sage http://snap.stanford.edu/graphsage/
  121. 121. Link Prediction Based using (SEAL) Link Prediction Based on Graph Neural Networks
  122. 122. Motifs Link Prediction via Higher-Order Motif Features
  123. 123. Motifs Link Prediction via Higher-Order Motif Features
  124. 124. Knowledge Graphs Explainable Reasoning over Knowledge Graphs for Recommendation
  125. 125. Knowledge Graphs Explainable Reasoning over Knowledge Graphs for Recommendation
  126. 126. Knowledge Graphs Explainable Reasoning over Knowledge Graphs for Recommendation
  127. 127. Indirect Relationships
  128. 128. Connecting unconnected Things indirectly
  129. 129. What are the Top 10 Jobs for me • that are in the same location I’m in • for which I have the necessary qualifications
  130. 130. Partial Subgraph Search
  131. 131. Graph Search
  132. 132. Don’t use SOLR Facets for this! Multiple Dimensions AgeSize FeaturesProperty Cost
  133. 133. Multiple Dimensions Java 
 Audio Book! What about Publisher? 
 What about Author? 
 What about Publication Year? What about Java Version?
 What About…. Left parentheses, n, right parentheses, semi-colon!
  134. 134. Bucket or Group Values if you have to Discrete Values for Each Dimension
  135. 135. Nodes for Discrete Dimensional Values Dimensional Model *Use Named Relationship Types instead of HAS
  136. 136. Catalogs
  137. 137. Stupid Glasses Loud Pants Skate Boards Neon Colors 1 2 3 4 Who remembers this?
  138. 138. Look at how thick they were, even back in 1902! It’s a Sears Catalogue!
  139. 139. Ares Predator Street Samurai Catalog
  140. 140. With free two day shipping! Cypher Version of the Catalog
  141. 141. A tree is a simple graph A Tree of Data
  142. 142. So fast, it’s not even funny. Promotions About 2-4M Traversals per second per core Traversing a 50 level Tree UP costs practically nothing.
  143. 143. and many more use cases!
  144. 144. Thank You!

×