Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
● Similarity ● Path ﬁnding ● Link prediction ● Convenience & Misﬁts ● Workﬂow
We are still on the gameofthrones database and you can either run the following guide inside the Neo4j Browser :play http:...
Algorithms that evaluate how alike nodes are at an individual level either based on node attributes, neighbouring nodes or...
Supported gds.nodeSimilarity In development gds.alpha.similarity.cosine gds.alpha.similarity.euclidian gds.alpha.similarit...
Why? I mean … seriously, why this ﬁxation on the Jaccard algorithm? It's all over (data science in general, not just Graph...
Tom Hello dear, how was your day? Tom's Wife Great, how was yours? Ok. I taught Graph Data Science today. What did you do?...
Jaccard works on sets of numbers … but what is quite unique about it, is that it gives relevant results if the sets are no...
Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ...
Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ...
Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ...
Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ...
CALL gds.nodeSimilarity.stream('got-character-related-entities',{ degreeCutoﬀ: 20, similarityCutoﬀ: 0.45}) YIELD node1, no...
I'll let you in on a little secret. Jaccard similarity is slow. I mean it's a simple enough calculation, but depending on ...
Given that you'd expect most nodes not to be similar, there are two approaches to improve things ● Reduce the number of co...
Yes of course and that is asked so often that there are apoc-convenience functions for it ... MATCH (p:Person {name: "Greg...
It's a data scientist thing, knowing what topN (global limit) and topK (limit per node) are, but both are much more eﬃcien...
Enough streaming and checking done … we'll do it for real now and write back the result as a new relationship (SIMILARITY)...
And showtime ... CALL gds.nodeSimilarity.write('got-character-related-entities',{ degreeCutoﬀ: 20,topN: 10, topK: 1, write...
● Most of the similarity algorithms are not very graphy … and nobody said they were ● They are a key component in turning ...
Algorithms that identify an optimal route through your network. It doesn't get more graphy than these (after all, the prob...
● A* ● Breadth First Search ● Depth First Search ● Shortest Path ● Yen's K-shortest paths
● Minimum Weight Spanning Tree ● Single Source Shortest Path
● All Pairs Shortest Path To distract you from the fact that I have no image for this algorithm … ain't she cute?
● Random Walk
● Which is why I'm not listing their names, some of them are in fact several algorithms in one piece of code and will deﬁn...
All connections are equal ... MATCH (start:Person{name:'Gormon Tyrell'}) MATCH (end:Person{name:'Manfrey Martell'}) CALL g...
But some are more equal than others ... MATCH (start:Person{name:'Gormon Tyrell'}) MATCH (end:Person{name:'Manfrey Martell...
● Very untypical for analytics (and that's because these aren't really analytical) is the addition of a starting point and...
Show me the way to go home ... MATCH (start:Person{name:'Gormon Tyrell'}) MATCH (end:Person{name:'Manfrey Martell'}) CALL ...
● I doubt you are still carrying a physical map around in your car. The map can't adapt the weights. Your satnav or phone ...
● These are as graphy as they come, but not really what most of us would call analytical. ● They underpin a lot of the ana...
Algorithms that for a given pair of nodes, provide a measure of proximity based on the graph topology. In simpler words … ...
The moment the word prediction came on screen, your level of attention immediate rose. Did I predict that correctly? It's ...
Question is … can such a new/missing link in a network be predicted? Let's see what we have in the box ﬁrst ...
Adamic Adar Common Neighbours Preferential Attachment Resource Allocation Same Community Total Neighbours
● All of them are currently in alpha. ● All of them are (user deﬁned) functions, they take two nodes and the topology and ...
Over the course of the series (both book and tv) Jon and Daenerys, from not knowing each other at all grow closer, it is o...
Data Science Solution Architect phani.dathar@neo4j.com https://www.linkedin.com/in/gopi-dathar/ Phani created all of the m...
CALL db.relationshipTypes() YIELD relationshipType as book WITH book WHERE book STARTS WITH 'INTERACTS_' MATCH (n1:Person{...
In the internal video of one of these trainings Phani runs the diﬀerent algorithms one by one, is very surprised by some o...
● The scores are based on the topology, not on what happens in our head when we read or watch. It's clear from this data t...
● Currently in alpha stage, the functions still only run on the actual graph, not an in-memory projection! ● Deﬁnitely wor...
(sings) I believe I can ﬂy … I believe I can touch the sky
These don't ﬁt anywhere but are worth a look … ● gds.beta.graph.generate - generates a random in-memory graph for simulati...
You actually used a few of these already, mainly necessary to access the in-memory graph Pre-processing ● gds.util.NaN ● g...
You actually used a few of these already, mainly necessary to access the in-memory graph Post-processing ● gds.util.asNode...
You actually used a few of these already, mainly necessary to access the in-memory graph Helpers ● gds.graph.exists ● gds....
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
06 neo4j gds graph algorithms
Upcoming SlideShare
Loading in …5
×

06 neo4j gds graph algorithms

28 views

Published on

2-Day GDS Training, 20-21 January 2021

Published in: Data & Analytics
no profile picture user

  • Be the first to comment

  • Be the first to like this

06 neo4j gds graph algorithms

  1. 1. ● Similarity ● Path ﬁnding ● Link prediction ● Convenience & Misﬁts ● Workﬂow
  2. 2. We are still on the gameofthrones database and you can either run the following guide inside the Neo4j Browser :play http://neo4jguides.tomgeudens.io/gdsalgorithms.html (note that this requires a neo4j.conf setting to whitelist the host) or you open a regular browser session too and go to https://bit.ly/neo4j-gds-algorithms and cut-and-paste the commands from there
  3. 3. Algorithms that evaluate how alike nodes are at an individual level either based on node attributes, neighbouring nodes or relationship properties. And also the single most used way to turn bipartite graphs into monopartite graphs.
  4. 4. Supported gds.nodeSimilarity In development gds.alpha.similarity.cosine gds.alpha.similarity.euclidian gds.alpha.similarity.jaccard gds.alpha.similarity.overlap gds.alpha.similarity.pearson gds.alpha.ml.ann gds.beta.knn
  5. 5. Why? I mean … seriously, why this ﬁxation on the Jaccard algorithm? It's all over (data science in general, not just Graph Data Science). The reason is quite obvious once you see it, but do you know it? As an incentive … explain it to the class and ● I guarantee that at least 50% didn't know and you'll look genious ● I will then not have to do the joketrue story on the next slide
  6. 6. Tom Hello dear, how was your day? Tom's Wife Great, how was yours? Ok. I taught Graph Data Science today. What did you do? I sell yarn Tom, you know that … I also started a sock knitting class ... <sigh/> <sigh/> <retries> So I taught them we use Jaccard similarity ... <retries> So I'm teaching the circle the Jacquard knitting pattern ...
  7. 7. Jaccard works on sets of numbers … but what is quite unique about it, is that it gives relevant results if the sets are not the same size and allows for doubles (gives them weight as it were). Check for yourself, almost all the others require you to have equally sized deduplicated sets ... RETURN gds.alpha.similarity.jaccard([1,2,3],[1,2,2,3])
  8. 8. Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ['Person', 'Book', 'House', 'Culture'], '*');
  9. 9. Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ['Person', 'Book', 'House', 'Culture'], '*');
  10. 10. Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ['Person', 'Book', 'House', 'Culture'], '*');
  11. 11. Throw the bipartite subgraphs in the pot in-memory workspace ... CALL gds.graph.create ('got-character-related-entities', ['Person', 'Book', 'House', 'Culture'], '*');
  12. 12. CALL gds.nodeSimilarity.stream('got-character-related-entities',{ degreeCutoﬀ: 20, similarityCutoﬀ: 0.45}) YIELD node1, node2, similarity RETURN gds.util.asNode(node1).name as character1, gds.util.asNode(node2).name as character2, similarity ORDER BY similarity DESC; Bring to room temperature (around 20 degrees Celsius)
  13. 13. I'll let you in on a little secret. Jaccard similarity is slow. I mean it's a simple enough calculation, but depending on the approach you are comparing a set of numbers for a given node against the sets of numbers of all other nodes, worst case scenario an O(n2) problem.
  14. 14. Given that you'd expect most nodes not to be similar, there are two approaches to improve things ● Reduce the number of computations. And degreeCutoﬀ falls under that approach. We identify nodes with a small degree (less than 20 in the example) in the graph as not relevant. ● Reduce the output. Most of what comes back is noise. The similarityCutoﬀ is a percentage that indicates what similarity actually means for us. In the example two Persons need to be 45% similar or it's not relevant at all
  15. 15. Yes of course and that is asked so often that there are apoc-convenience functions for it ... MATCH (p:Person {name: "Gregor Clegane"}) RETURN apoc.node.degree(p), apoc.node.degree.in(p), apoc.node.degree.out(p);
  16. 16. It's a data scientist thing, knowing what topN (global limit) and topK (limit per node) are, but both are much more eﬃcient than a regular LIMIT (which is what they are) CALL gds.nodeSimilarity.stream('got-character-related-entities',{ degreeCutoﬀ: 20, topN: 10, topK: 1}) YIELD node1, node2, similarity RETURN gds.util.asNode(node1).name as character1, gds.util.asNode(node2).name as character2, similarity ORDER BY similarity DESC; Note that topK applies to a node as source, not to it appearing as target. Check out Loras Tyrell in these and the previous results.
  17. 17. Enough streaming and checking done … we'll do it for real now and write back the result as a new relationship (SIMILARITY) to the graph. First we'll clean up any that are there though MATCH (:Person)-[s:SIMILARITY]->(:Person) DELETE s;
  18. 18. And showtime ... CALL gds.nodeSimilarity.write('got-character-related-entities',{ degreeCutoﬀ: 20,topN: 10, topK: 1, writeRelationshipType: 'SIMILARITY', writeProperty: 'character_similarity'}); MATCH sg=(:Person {name: "Gregor Clegane"})-[:SIMILARITY]-() RETURN sg; Quick check
  19. 19. ● Most of the similarity algorithms are not very graphy … and nobody said they were ● They are a key component in turning bipartite subgraphs into monopartite subgraphs making them a typical pre-analytics step ● Do not underestimate the immediate (in real time) value a SIMILAR relationship can bring though! ● Use the CutOﬀ options wisely. Raising them (especially similarityCutOﬀ) can massively increase the relevance and massively decrease the runtime!
  20. 20. Algorithms that identify an optimal route through your network. It doesn't get more graphy than these (after all, the problem that started the whole of Graph Theory was a path ﬁnding problem) but they are not often associated with analytics.
  21. 21. ● A* ● Breadth First Search ● Depth First Search ● Shortest Path ● Yen's K-shortest paths
  22. 22. ● Minimum Weight Spanning Tree ● Single Source Shortest Path
  23. 23. ● All Pairs Shortest Path To distract you from the fact that I have no image for this algorithm … ain't she cute?
  24. 24. ● Random Walk
  25. 25. ● Which is why I'm not listing their names, some of them are in fact several algorithms in one piece of code and will deﬁnitely be refactored sometime soon. ● It's not that these aren't well understood … in fact … they are amongst the best understood algorithms on the planet. And key pieces of what they do can be found in other algorithms (take the centrality algorithms). People forget that the oﬃcial GDS library exists for less than a year and the oﬃcial engineering team for it is less than 18 months in place. They ﬁrst picked what made money already … can you blame them?
  26. 26. All connections are equal ... MATCH (start:Person{name:'Gormon Tyrell'}) MATCH (end:Person{name:'Manfrey Martell'}) CALL gds.alpha.shortestPath.stream({ nodeProjection: 'Person', relationshipProjection: { INTERACTS_SEASON4: { type: 'INTERACTS_4', orientation: 'UNDIRECTED'}}, startNode: start, endNode: end}) YIELD nodeId, cost RETURN gds.util.asNode(nodeId).name AS name, cost;
  27. 27. But some are more equal than others ... MATCH (start:Person{name:'Gormon Tyrell'}) MATCH (end:Person{name:'Manfrey Martell'}) CALL gds.alpha.shortestPath.stream({ nodeProjection: 'Person', relationshipProjection: { INTERACTS_SEASON4: { type: 'INTERACTS_4', properties: 'weight', orientation: 'UNDIRECTED'}}, startNode: start, endNode: end, relationshipWeightProperty: 'weight'}) YIELD nodeId, cost RETURN gds.util.asNode(nodeId).name AS name, cost;
  28. 28. ● Very untypical for analytics (and that's because these aren't really analytical) is the addition of a starting point and and an ending point. ● Dijkstra's algorithm - which is what is used here - computes a weighted shortest path. If no weight is given we solve it by giving each hop an equal weight of one. ● Playing around with these weights is the trick here. Note that the weight I used did actually not represent a cost. And while the Dijkstra algorithm can not use negative costs … fractions is totally ﬁne ...
  29. 29. Show me the way to go home ... MATCH (start:Person{name:'Gormon Tyrell'}) MATCH (end:Person{name:'Manfrey Martell'}) CALL gds.alpha.kShortestPaths.stream({ nodeProjection: 'Person', relationshipProjection: { INTERACTS_SEASON4: {type: 'INTERACTS_4', properties: 'weight', orientation: 'UNDIRECTED'}}, startNode: start, endNode: end, k: 3, path: true, relationshipWeightProperty: 'weight'}) YIELD path RETURN path;
  30. 30. ● I doubt you are still carrying a physical map around in your car. The map can't adapt the weights. Your satnav or phone can. ● Nobody cares if you get stuck somewhere and lose some heartbeats in frustration, but if a cargo plane, train, ship, truck does … it's losing money and people care. ● ...
  31. 31. ● These are as graphy as they come, but not really what most of us would call analytical. ● They underpin a lot of the analytical algorithms. ● Currently in alpha stage, still using the Anonymous Graph … be aware that both of these will change ● The economical value is undeniably huge.
  32. 32. Algorithms that for a given pair of nodes, provide a measure of proximity based on the graph topology. In simpler words … functions that return a score indicating whether two given nodes ● will develop a connection ● should be connected ● are possibly missing a connection
  33. 33. The moment the word prediction came on screen, your level of attention immediate rose. Did I predict that correctly? It's like putting the word AI on screen. So allow me to apply the necessary medicine ... There … feeling all better now? Great!
  34. 34. Question is … can such a new/missing link in a network be predicted? Let's see what we have in the box ﬁrst ...
  35. 35. Adamic Adar Common Neighbours Preferential Attachment Resource Allocation Same Community Total Neighbours
  36. 36. ● All of them are currently in alpha. ● All of them are (user deﬁned) functions, they take two nodes and the topology and return a score. ● They are all variations on the same theme … they compute a similarity score based on the topology around the two nodes. This is however not a percentage but a raw score that must be compared to other such scores over variations of the topology (say over a period of time) so that a trend can be established.
  37. 37. Over the course of the series (both book and tv) Jon and Daenerys, from not knowing each other at all grow closer, it is one of the main story arcs. We expect to see this very evolution in the link predictions scores.
  38. 38. Data Science Solution Architect phani.dathar@neo4j.com https://www.linkedin.com/in/gopi-dathar/ Phani created all of the materials (the full two days, originally even ﬁve days). I've worked with him in the ﬁeld (on a single blood cell analysis usecase) and he's one of those peoples that removes the airquotes from around data science and puts accolades in place. A true master. He loves Game Of Thrones too much though ...
  39. 39. CALL db.relationshipTypes() YIELD relationshipType as book WITH book WHERE book STARTS WITH 'INTERACTS_' MATCH (n1:Person{name:'Daenerys Targaryen'}) MATCH (n2:Person{name:'Jon Snow'}) RETURN gds.alpha.linkprediction.commonNeighbors(n1, n2, {relationshipQuery:book}) AS cn_score, gds.alpha.linkprediction.preferentialAttachment(n1, n2, {relationshipQuery:book}) AS pa_score, gds.alpha.linkprediction.adamicAdar(n1, n2, {relationshipQuery:book}) AS aa_score, gds.alpha.linkprediction.resourceAllocation(n1, n2, {relationshipQuery:book}) AS ra_score, gds.alpha.linkprediction.totalNeighbors(n1, n2, {relationshipQuery:book}) AS tn_score, book
  40. 40. In the internal video of one of these trainings Phani runs the diﬀerent algorithms one by one, is very surprised by some of the results (which even contradict the expectation as you can see for yourself). He struggles with it, comes close to dismissing the data as false (it is in many places actually) … but then the SCIENTIST inside kicks in ...
  41. 41. ● The scores are based on the topology, not on what happens in our head when we read or watch. It's clear from this data that the seeds for their relationship were there much much earlier than you would expect. ● We asked a leading/loaded question. That was a mistake! ● Both data and domain determine the best algorithms to use in a speciﬁc case. Proper experiments will show the way … it's not magic/it's not a magic pill. Also, what the Flamingo happened in book Four?
  42. 42. ● Currently in alpha stage, the functions still only run on the actual graph, not an in-memory projection! ● Deﬁnitely worth exploring these require the highest level of scrutiny and understanding (both of what they do and of what your data is like) ● This is what underpins things like churn prediction, disambiguation, … the potential economic value is huge.
  43. 43. (sings) I believe I can ﬂy … I believe I can touch the sky
  44. 44. These don't ﬁt anywhere but are worth a look … ● gds.beta.graph.generate - generates a random in-memory graph for simulation (memory estimation mostly) purposes ● gds.alpha.ml.oneHotEncoding - turns data into vectors
  45. 45. You actually used a few of these already, mainly necessary to access the in-memory graph Pre-processing ● gds.util.NaN ● gds.util.isFinite, gds.util.isInﬁnite ● gds.util.inﬁnity
  46. 46. You actually used a few of these already, mainly necessary to access the in-memory graph Post-processing ● gds.util.asNode ● gds.util.asNodes ● gds.util.asPath warning! this does not return a real path!
  47. 47. You actually used a few of these already, mainly necessary to access the in-memory graph Helpers ● gds.graph.exists ● gds.version

×