Optimized Graph
Algorithms in Neo4j
Use the Power of Connections to Drive Discovery
January 2018
Mark Needham
Amy Hodler
Mark Needham
Software Engineer, Neo4j
mark.needham@neo4j.com
@markhneedham
Next 50 Minutes
• Why Use Graph Analytics
• Randomness vs. Reality
• Graph Analytics Takes Off
• How to Run Graph Analytics
• Neo4j Graph Analytics and Algorithms
• Demos and Implementation
Graph
Algorithms
Real-World
Networks
Amy E. Hodler
Analytics Marketing, Neo4j
amy.hodler@neo4j.com
@amyhodler
Understand. Predict. Prescribe.
Forecast Complex Network Behavior
and Prescribe Action
Cascading Failures
Airline Congestion - 2010
Source: “Systemic delay propagation in the US airport network” – Fleurquin, Ramasco, Eguiluz
Planning and Least
Cost Routing
Bridge Points
Languages – Telecom Network
Source: “Fast unfolding of communities in large networks” – Blondel, Guillaume, Lambiotte, Lefebvre
Extract Structure and Model Processes
Real Networks Aren’t Random
Preferential
Attachment
Nodes tend to link to nodes
that already have a lot of links
Origins Debated
• Local Mechanisms
• Global Optimization
• Mixed or Other
Network Structures are Inseparable from Development
Concentrated
Distribution
Source: “How Stuff Spreads” – Pulsar Platform
NodeswithkLinks
Number of links (k)
Many nodes with only
a few links
A few hubs with a
large number of links
Power Law Distribution
“There is No Network in Nature that we
know of that would be described by the
Random network model.”
- Albert-László Barabási
Small-World
High local clustering
and short average path
lengths. Hub and spoke
architecture.
Scale-Free
Hub and spoke
architecture preserved
at multiple scales. High
power law distribution.
Random
Average distributions.
No structure or
hierarchical patterns.
Reality
The Lure of Averages
Source: Network Science - Barabasi
Art: Ulysses and the Sirens – Herbert James Draper
NodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Most nodes have the
same number of links
No highly
connected nodes
Resist The Lure
of AveragesNodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Most nodes have the
same number of links
No highly
connected nodes
NodeswithkLinks
Number of links (k)
Power Law Distribution
- Scale-Free -
Many nodes with only
a few links
A few hubs with a
large number of links
Source: Network Science - Barabasi
Resist The Lure
of AveragesNodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Art: Ulysses and the Sirens – Herbert James Draper
Most nodes have the
same number of links
No highly
connected nodes
You’ll Miss the Structure
Hidden in Your Networks
- Scale-Free -
- Small World -
Source: Network Science - Barabasi
Graph Analytics
Takes Off
#Finally!
Leonhard Euler 1707-1783
Critical Mass
• Collect, share and analyze
massive connected data
• Discovered common
principles and structures
• Existing mathematical tools
• Unfulfilled promises of
big data
Insights from Algorithms
Insights from Algorithms
Graph Algorithms
• Metrics
• Relevance
• Clustering
• Structural Insights
Machine Learning
• Classification, Regression
• NLP, Structural/Content
Predictions
• Neural Networks as Graphs
• Graph As Compute Fabric
Structures Can Hide
Source: “Communities, modules and large-scale structure in networks“ - Mark Newman
Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and
inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.
Graph of Thrones
A. Beveridge: GoT - Interaction Graph from Books
Graph of Thrones
A. Beveridge: GoT - Interaction Graph from Books
How to Run
Graph Analytics?
Existing Options (so far)
•Data Processing
•Spark with GraphX, Flink with Gelly
•Dedicated Graph Processing
• Urika, GraphLab, Giraph, Mosaic, GPS, Signal-Collect,
Gradoop
•Data Scientist Toolkit
• igraph, NetworkX, Boost(graph-tool) in Python, R, C
Drawbacks
• Manage several tools
• Selection -> learning ->
installation -> operation
• Data selection, projection and
transfer
• Tedious and time consuming
• Scalability
• Especially classic data
science tools
An Example
From Past GraphConnect
Source: John Swain - Twitter Analytics Right Relevance Talk
Many Moving Parts!
Example Workflow Pipeline
Twitter
Streaming API
Python Tweet
Collection
(includes user
data)
Rabbit
MQ
MongoDB
Neo4j
R Scripts
-Graph Stats
-Community
Detection
MySQL
Graph
.graphml
Tableau
Graph
Visualization
Moved from Twitter
Search API to
Streaming API
Replaced Python
Twitter libraries
(Tweepy) with raw API
calls
Streaming tweets in message queue
Full tweets and user data stored in
MongoDB
Built graph for analysis in Neo4j from
tweets persisted in MongoDB
Analysis in R
iGraph libraries for
algorithms
Some text analysis e.g.
LDA topics
Results published in MySQL
for Tableau
Graphml for import to Gephi
with stats precalculated
Our Goal
Twitter
Streaming API
Python Tweet
Collection
(includes user
data)
Rabbit
MQ
MongoDB
Neo4j
R Scripts
-Graph Stats
-Community
Detection
MySQL
Graph
.graphml
Tableau
Graph
Visualization
Example Workflow Pipeline
Neo4j Graph Analytics
and Algorithms
Neo4j
Native Graph
Database
Analytics
Integrations
Cypher Query
Language
Wide Range of
APOC Procedures
Optimized
Graph Algorithms
Finds the optimal path
or evaluates route
availability and quality
Evaluates how a
group is clustered
or partitioned
Determines the
importance of distinct
nodes in the network
1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and configuration
3. ~.stream variant returns (a lot) of results
CALL algo.<name>.stream('Label','TYPE',{conf})
YIELD nodeId, score
4. non-stream variant writes results to graph returns statistics
CALL algo.<name>('Label','TYPE',{conf})
Usage
Pass in Cypher statement for node- and relationship-lists.
CALL algo.<name>(
'MATCH ... RETURN id(n)',
'MATCH (n)-->(m)
RETURN id(n) as source,
id(m) as target', {graph:'cypher'})
Cypher Projection
• PageRank (baseline)
• Betweeness
• Closeness
• Degree
Algorithms - Centralities
Pathfinding
Centrality
Community
Detection
• Label Propagation
• Union Find / WCC
• Strongly Connected Components
• Louvain
• Triangle-Count / Clustering Coefficent
Algorithms – Communitity Detection
Pathfinding
Community
Detection
Centrality
• Single Source Short Path
• All-Nodes SSP
• Parallel BFS / DFS
Algorithms - Pathfinding
Centrality Community
Detection
Pathfinding
Iterate Quickly
• Combine data from sources into one graph
• Project to relevant subgraphs
• Enrich data with algorithms
• Traverse, collect, filter aggregate with queries
• Visualize, Explore, Decide, Export
• From all APIs and Tools
Demo Time!
Datasets
Yelp Business Graph
• 5m nodes
• 17m relationships
Bitcoin
• 1.7bn nodes,
• 2.7bn rels
DBPedia
• 11m nodes
• 116m relationships
DBpedia
DBPedia
Shallow Copy of Wikipedia: (Page) -[:Link]-> (Page)
CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score
WITH *
ORDER BY score DESC
LIMIT 5
RETURN node.title, score;
+--------------------------------------+
| node.title | score |
+--------------------------------------+
| "United States" | 13349.2 |
| "Animal" | 6077.77 |
| "France" | 5025.61 |
| "List of sovereign states" | 4913.92 |
| "Germany" | 4662.32 |
+--------------------------------------+
5 rows 46 seconds
DBPedia – Largest Clusters
CALL algo.labelPropagation();
// First 1M pages by Rank
MATCH (n:Page)
WITH n
ORDER BY n.pagerank DESC
LIMIT 1000000
// group by partition
WITH n.partition AS partition,
count(*) AS clusterSize,
collect(n.title) AS pages
// return most influential node for largest clusters
RETURN pages[0] AS mainPage,
pages[1..10] AS otherPages
ORDER BY clusterSize DESC
LIMIT 20
Yelp
Yelp
• Business Reviews by Users
•Businesses have Categories and Locations
•Users have Friends
•Bi-partite-Network (:User)-->(:Business)
projections (:User)<-->(:User) &
(:Business)<-->(:Business)
Yelp – Social - Statistics
MATCH (u:User) where exists ( (u)-[:FRIENDS]-() )
WITH u.average_stars as stars, u.review_count as reviews, u.funny as funny
RETURN max(stars),avg(stars),stdev(stars),max(reviews),avg(reviews),stdev(reviews),max(funny),avg(funny),stdev(funny);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| max(stars) | avg(stars) | stdev(stars) | max(reviews) | avg(reviews) | stdev(reviews) | max(funny) | avg(funny) | stdev(funny) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 5.0 | 3.8238072950764947 | 0.8862511758625753 | 11284 | 45.81704314022204 | 120.52419266925014 | 170896 | 36.26637835535585 | 731.6024752545679 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
MATCH (u:User) where exists ( (u)-[:FRIENDS]-() )
WITH u.yelping_since as since
RETURN substring(since,0,4) as year, count(*) as total
ORDER BY year asc limit 10;
+----------------+
| year | total |
+----------------+
| "2004" | 64 |
| "2005" | 844 |
| "2006" | 4504 |
| "2007" | 11833 |
| "2008" | 20729 |
| "2009" | 33965 |
| "2010" | 53046 |
| "2011" | 70331 |
| "2012" | 62596 |
| "2013" | 57330 |
+----------------+
Yelp – Social - PageRank
call algo.pageRank.stream('User','FRIENDS')
yield node,score with node,score
order by score desc limit 10
return node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny},
score,
size( (node)<-[:FRIENDS]-()<-[:FRIENDS]-()) as in,
size( (node)-[:FRIENDS]->()-[:FRIENDS]->()) as out;
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| node | score |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| {funny -> 61200, name -> "Philip", average_stars -> 3.93, review_count -> 788, useful -> 69448, yelping_since -> "2007-06-09"} | 208.31336799999994 |
| {funny -> 21432, name -> "Des", average_stars -> 3.88, review_count -> 78, useful -> 140024, yelping_since -> "2014-04-01"} | 201.28600150000003 |
| {funny -> 465, name -> "Dallas", average_stars -> 4.17, review_count -> 330, useful -> 5517, yelping_since -> "2010-11-07"} | 192.164762 |
| {funny -> 1019, name -> "Cara", average_stars -> 3.96, review_count -> 842, useful -> 11738, yelping_since -> "2010-07-21"} | 184.01898249999996 |
| {funny -> 1233, name -> "Walker", average_stars -> 3.91, review_count -> 462, useful -> 12332, yelping_since -> "2007-01-25"} | 180.48898350000005 |
| {funny -> 13432, name -> "Gabi", average_stars -> 4.05, review_count -> 1730, useful -> 20759, yelping_since -> "2007-08-10"} | 163.29424850000004 |
| {funny -> 12848, name -> "Ruggy", average_stars -> 3.92, review_count -> 2118, useful -> 72265, yelping_since -> "2007-07-31"} | 161.87635500000002 |
| {funny -> 9997, name -> "Bill", average_stars -> 3.38, review_count -> 595, useful -> 12074, yelping_since -> "2014-04-05"} | 157.0438075 |
| {funny -> 1544, name -> "Ashley", average_stars -> 3.7, review_count -> 224, useful -> 1610, yelping_since -> "2009-09-29"} | 150.21423599999997 |
| {funny -> 3599, name -> "Risa", average_stars -> 4.08, review_count -> 1044, useful -> 22121, yelping_since -> "2011-07-30"} | 138.20863199999997 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows
3236 ms
Yelp
•Inferred network of users, via jointly reviewed businesses
• (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User)
• 1,3bn paths
• Inferred network of businesses, via jointly reviewed by user
• (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business)
• 214m paths
• subset: (b1:Business)-[:CO_OCCURENT_REVIEWS]-(b2:Business)
Yelp
•Inferred network of users, via jointly reviewed businesses
• (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User)
• 1.3bn paths
• Inferred network of businesses, via jointly reviewed by user
• (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business)
• 214m paths
Yelp – Business – Co-Occurrence
•Find clusters of "similar" businesses
•Find peer groups of similar people
•Clusters of "interests"
Yelp – Business – Co-Occurrence
CALL apoc.periodic.iterate(
'MATCH (b:Business)
WHERE size((b)<-[:REVIEWS]-()) > 5 AND b.city="Las Vegas"
RETURN b',
'MATCH (b)<-[:REVIEWS]-(r1)<-[:WROTE]-(u)-[:WROTE]->(r2)-[:REVIEWS]->(b2)
WHERE id(b) < id(b2) AND b2.city="Las Vegas"
AND size((b2)<-[:REVIEWS]-()) > 5
AND r1.stars = r2.stars
WITH b, b2, count(*) AS weight, avg(r1.stars) as rating where weight > 5
MERGE (b)-[cr:B2B]-(b2)
ON CREATE SET cr.weight = weight, cr.rating = rating
SET b:Marked, b2:Marked',
{batchSize: 1});
Yelp - Clustering Union Find
CALL algo.unionFind.stream(
'MATCH (b:Business:Marked) RETURN id(b) as id’,
'MATCH (b1:Business:Marked)-[r:B2B]-(b2)
RETURN id(b1) as source,
id(b2) as target,
count(r) as value',
{graph:'cypher'}) YIELD setId as cluster, nodeId
RETURN cluster, count(*) as size
ORDER BY size DESC LIMIT 10;
+--------------+
|cluster| size |
+--------------+
| 3 | 5625 |
| 1876 | 3 |
| 155 | 2 |
| 1091 | 2 |
| 1728 | 2 |
| 1177 | 2 |
| 337 | 2 |
| 3046 | 2 |
| 674 | 2 |
| 1948 | 2 |
+--------------+
10 rows
6615 ms
Yelp - PageRank
CALL algo.pageRank.stream(
'MATCH (b:Business:Marked)
RETURN id(b) as id',
'MATCH (b1:Business:Marked)-[r:B2B]-(b2)
RETURN id(b1) as source,
id(b2) as target',
{graph:'cypher'})
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 10;
+-------------------------------------------------------+
| node.name | score |
+-------------------------------------------------------+
| "McCarran International Airport" | 27.49973599999999 |
| "Hash House A Go Go" | 19.062398000000005 |
| "Bachi Burger" | 18.1494385 |
| "Mon Ami Gabi" | 17.720350000000003 |
| "Bacchanal Buffet" | 15.783480500000003 |
| "Yard House Town Square" | 14.427296999999998 |
| "Secret Pizza" | 13.156547 |
| "Rollin Smoke Barbeque" | 12.808718499999998 |
| "Wicked Spoon" | 12.639942499999997 |
| "Monta Ramen" | 12.3904845 |
+-------------------------------------------------------+
10 rows
6979 ms
BitCoin
BitCoin Graph
• Full Copy of the BitCoin BlockChain
• from learnmeabitcoin.com (Greg Walker)
• 1.7 billion nodes, 2.7 billion rels
• 474k blocks, 240m tx, 280m addresses, 650m outputs
• 600 GB on disk
BitCoin Graph
BitCoin Graph
Distribution of "locked" relationships for "addresses"
(participation in transactions)
call apoc.stats.degrees('<locked');
+--------------------------------------------------------------------------------------------------------------+
| type | direction | total | p50 | p75 | p90 | p95 | p99 | p999 | max | min | mean |
+--------------------------------------------------------------------------------------------------------------+
| "locked" | "INCOMING" | 654662356 | 0 | 0 | 1 | 1 | 2 | 28 | 1891327 | 0 | 0.37588608290716047 |
+--------------------------------------------------------------------------------------------------------------+
1 row
308 seconds
BitCoin Graph
Inferred network of addresses, via transaction and output
(a1)<-[:locked]-(o1)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2)
CALL algo.unionFind.stream(
'match (o:output)-[:locked]->(a) with a limit 10000000 return id(a) as id',
'match (o:output)-[:locked]->(a) with o limit 10000000
match (o)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2)
return id(a) as source, id(a2) as target, count(tx) as weight',
{graph:'cypher'})
YIELD setId as cluster, nodeId
RETURN cluster, count(*) AS size
ORDER BY size DESC
LIMIT 10;
+-------------------+
| cluster | size |
+-------------------+
| 5036 | 4409420 |
| 6295282 | 1999 |
| 5839746 | 1488 |
| 9356302 | 833 |
| 6560901 | 733 |
| 6370777 | 637 |
| 8101710 | 392 |
| 5945867 | 369 |
| 2489036 | 264 |
| 1703620 | 203 |
+-------------------+
10 rows, 296 seconds
Implementation
Design Considerations
• Ease of Use – Call as Procedures
• Parallelize everything: load, compute, write
• Efficiency: Use direct access, efficient datastructures, provide
high-level API
• Scale to billions of nodes and relationships
Use up to hundreds of CPUs and Terabytes of RAM
1. Load Data in parallel
from Neo4j
2. Store in efficient data
structures
3. Run Graph Algorithm
in parallel using
Graph API
4. Write data back in
parallel
Neo4j
1, 2
Algorithm
Datastructures
4
3
Graph API
Architecture
Scale: 144 CPU
Neo4j Graph Platform with Neo4j Algorithms
vs. Apache Spark’s GraphX
0
50
100
150
200
250
300
350
400
450
Union-Find (Connected Components) PageRank
251
Seconds
152
416
124
Neo4j is
Significantly
Faster
Spark GraphX results publicly available
• Amazon EC2 cluster running 64-bit Linux
• 128 CPUs with 68 GB of memory, 2 hard disks
Neo4j Configuration
• Physical machine running 64-bit Linux
• 128 CPUs with 55 GB RAM, SSDs
Twitter 2010 Dataset
• 1.47 Billion Relationships
• 41.65 Million Nodes
GraphX
Neo4j
Neo4j
GraphX
Compute At Scale – Payment Graph
3,000,000,000 nodes and 18,000,000,000 relationships (600G)
PageRank (20 iterations) on 1 machine, 20 threads, 900G RAM
call algo.pageRank('Account','SENT',
{graph:'huge',iterations:20,write:true,concurrency:20});
+-------------------------------------------------------------------+
| nodes | iterations | loadMillis | computeMillis | writeMillis |
+-------------------------------------------------------------------+
| 300000000 | 20 | 401404 | 6024994 | 47106 |
+-------------------------------------------------------------------+
1 row 6473526 ms -> 1h 47min
We Need Your Feedback
• neo4j.com/slack at #neo4j-graph-algorithms
• github.com/neo4j-contrib/neo4j-graph-algorithms
• Whitepaper on neo4j.com/graph-analytics
Graphs are one of
the Unifying Themes of computer science . . .
That so many different structures
can be modeled using a single formalism
is a Source of Great Power
to the educated programmer.”
- Steven S. Skiena,
The Algorithm Design Manual
“
Kudos:
Paul Horn
Martin Knobloch from Avantgarde Labs
Tomasz Bratanic (docs)
Thank You!
Questions !?

Graph Analytics: Graph Algorithms Inside Neo4j

  • 1.
    Optimized Graph Algorithms inNeo4j Use the Power of Connections to Drive Discovery January 2018 Mark Needham Amy Hodler
  • 2.
    Mark Needham Software Engineer,Neo4j mark.needham@neo4j.com @markhneedham Next 50 Minutes • Why Use Graph Analytics • Randomness vs. Reality • Graph Analytics Takes Off • How to Run Graph Analytics • Neo4j Graph Analytics and Algorithms • Demos and Implementation Graph Algorithms Real-World Networks Amy E. Hodler Analytics Marketing, Neo4j amy.hodler@neo4j.com @amyhodler
  • 3.
  • 4.
    Forecast Complex NetworkBehavior and Prescribe Action
  • 5.
    Cascading Failures Airline Congestion- 2010 Source: “Systemic delay propagation in the US airport network” – Fleurquin, Ramasco, Eguiluz
  • 6.
  • 7.
    Bridge Points Languages –Telecom Network Source: “Fast unfolding of communities in large networks” – Blondel, Guillaume, Lambiotte, Lefebvre
  • 8.
    Extract Structure andModel Processes
  • 9.
  • 10.
    Preferential Attachment Nodes tend tolink to nodes that already have a lot of links Origins Debated • Local Mechanisms • Global Optimization • Mixed or Other Network Structures are Inseparable from Development
  • 11.
    Concentrated Distribution Source: “How StuffSpreads” – Pulsar Platform NodeswithkLinks Number of links (k) Many nodes with only a few links A few hubs with a large number of links Power Law Distribution
  • 12.
    “There is NoNetwork in Nature that we know of that would be described by the Random network model.” - Albert-László Barabási
  • 13.
    Small-World High local clustering andshort average path lengths. Hub and spoke architecture. Scale-Free Hub and spoke architecture preserved at multiple scales. High power law distribution. Random Average distributions. No structure or hierarchical patterns.
  • 14.
  • 15.
    The Lure ofAverages Source: Network Science - Barabasi Art: Ulysses and the Sirens – Herbert James Draper NodeswithkLinks Number of Links (k) Average Distribution - Random - Most nodes have the same number of links No highly connected nodes
  • 16.
    Resist The Lure ofAveragesNodeswithkLinks Number of Links (k) Average Distribution - Random - Most nodes have the same number of links No highly connected nodes NodeswithkLinks Number of links (k) Power Law Distribution - Scale-Free - Many nodes with only a few links A few hubs with a large number of links Source: Network Science - Barabasi
  • 17.
    Resist The Lure ofAveragesNodeswithkLinks Number of Links (k) Average Distribution - Random - Art: Ulysses and the Sirens – Herbert James Draper Most nodes have the same number of links No highly connected nodes You’ll Miss the Structure Hidden in Your Networks - Scale-Free - - Small World -
  • 18.
  • 19.
  • 20.
  • 21.
    Critical Mass • Collect,share and analyze massive connected data • Discovered common principles and structures • Existing mathematical tools • Unfulfilled promises of big data
  • 22.
  • 23.
    Insights from Algorithms GraphAlgorithms • Metrics • Relevance • Clustering • Structural Insights Machine Learning • Classification, Regression • NLP, Structural/Content Predictions • Neural Networks as Graphs • Graph As Compute Fabric
  • 24.
    Structures Can Hide Source:“Communities, modules and large-scale structure in networks“ - Mark Newman Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.
  • 25.
    Graph of Thrones A.Beveridge: GoT - Interaction Graph from Books
  • 26.
    Graph of Thrones A.Beveridge: GoT - Interaction Graph from Books
  • 27.
    How to Run GraphAnalytics?
  • 28.
    Existing Options (sofar) •Data Processing •Spark with GraphX, Flink with Gelly •Dedicated Graph Processing • Urika, GraphLab, Giraph, Mosaic, GPS, Signal-Collect, Gradoop •Data Scientist Toolkit • igraph, NetworkX, Boost(graph-tool) in Python, R, C
  • 29.
    Drawbacks • Manage severaltools • Selection -> learning -> installation -> operation • Data selection, projection and transfer • Tedious and time consuming • Scalability • Especially classic data science tools
  • 30.
  • 31.
    Source: John Swain- Twitter Analytics Right Relevance Talk
  • 32.
    Many Moving Parts! ExampleWorkflow Pipeline Twitter Streaming API Python Tweet Collection (includes user data) Rabbit MQ MongoDB Neo4j R Scripts -Graph Stats -Community Detection MySQL Graph .graphml Tableau Graph Visualization Moved from Twitter Search API to Streaming API Replaced Python Twitter libraries (Tweepy) with raw API calls Streaming tweets in message queue Full tweets and user data stored in MongoDB Built graph for analysis in Neo4j from tweets persisted in MongoDB Analysis in R iGraph libraries for algorithms Some text analysis e.g. LDA topics Results published in MySQL for Tableau Graphml for import to Gephi with stats precalculated
  • 33.
    Our Goal Twitter Streaming API PythonTweet Collection (includes user data) Rabbit MQ MongoDB Neo4j R Scripts -Graph Stats -Community Detection MySQL Graph .graphml Tableau Graph Visualization Example Workflow Pipeline
  • 34.
  • 35.
    Neo4j Native Graph Database Analytics Integrations Cypher Query Language WideRange of APOC Procedures Optimized Graph Algorithms
  • 36.
    Finds the optimalpath or evaluates route availability and quality Evaluates how a group is clustered or partitioned Determines the importance of distinct nodes in the network
  • 37.
    1. Call asCypher procedure 2. Pass in specification (Label, Prop, Query) and configuration 3. ~.stream variant returns (a lot) of results CALL algo.<name>.stream('Label','TYPE',{conf}) YIELD nodeId, score 4. non-stream variant writes results to graph returns statistics CALL algo.<name>('Label','TYPE',{conf}) Usage
  • 38.
    Pass in Cypherstatement for node- and relationship-lists. CALL algo.<name>( 'MATCH ... RETURN id(n)', 'MATCH (n)-->(m) RETURN id(n) as source, id(m) as target', {graph:'cypher'}) Cypher Projection
  • 39.
    • PageRank (baseline) •Betweeness • Closeness • Degree Algorithms - Centralities Pathfinding Centrality Community Detection
  • 40.
    • Label Propagation •Union Find / WCC • Strongly Connected Components • Louvain • Triangle-Count / Clustering Coefficent Algorithms – Communitity Detection Pathfinding Community Detection Centrality
  • 41.
    • Single SourceShort Path • All-Nodes SSP • Parallel BFS / DFS Algorithms - Pathfinding Centrality Community Detection Pathfinding
  • 42.
    Iterate Quickly • Combinedata from sources into one graph • Project to relevant subgraphs • Enrich data with algorithms • Traverse, collect, filter aggregate with queries • Visualize, Explore, Decide, Export • From all APIs and Tools
  • 43.
  • 44.
    Datasets Yelp Business Graph •5m nodes • 17m relationships Bitcoin • 1.7bn nodes, • 2.7bn rels DBPedia • 11m nodes • 116m relationships
  • 45.
  • 46.
    DBPedia Shallow Copy ofWikipedia: (Page) -[:Link]-> (Page) CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score WITH * ORDER BY score DESC LIMIT 5 RETURN node.title, score; +--------------------------------------+ | node.title | score | +--------------------------------------+ | "United States" | 13349.2 | | "Animal" | 6077.77 | | "France" | 5025.61 | | "List of sovereign states" | 4913.92 | | "Germany" | 4662.32 | +--------------------------------------+ 5 rows 46 seconds
  • 47.
    DBPedia – LargestClusters CALL algo.labelPropagation(); // First 1M pages by Rank MATCH (n:Page) WITH n ORDER BY n.pagerank DESC LIMIT 1000000 // group by partition WITH n.partition AS partition, count(*) AS clusterSize, collect(n.title) AS pages // return most influential node for largest clusters RETURN pages[0] AS mainPage, pages[1..10] AS otherPages ORDER BY clusterSize DESC LIMIT 20
  • 48.
  • 49.
    Yelp • Business Reviewsby Users •Businesses have Categories and Locations •Users have Friends •Bi-partite-Network (:User)-->(:Business) projections (:User)<-->(:User) & (:Business)<-->(:Business)
  • 50.
    Yelp – Social- Statistics MATCH (u:User) where exists ( (u)-[:FRIENDS]-() ) WITH u.average_stars as stars, u.review_count as reviews, u.funny as funny RETURN max(stars),avg(stars),stdev(stars),max(reviews),avg(reviews),stdev(reviews),max(funny),avg(funny),stdev(funny); +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | max(stars) | avg(stars) | stdev(stars) | max(reviews) | avg(reviews) | stdev(reviews) | max(funny) | avg(funny) | stdev(funny) | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 5.0 | 3.8238072950764947 | 0.8862511758625753 | 11284 | 45.81704314022204 | 120.52419266925014 | 170896 | 36.26637835535585 | 731.6024752545679 | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ MATCH (u:User) where exists ( (u)-[:FRIENDS]-() ) WITH u.yelping_since as since RETURN substring(since,0,4) as year, count(*) as total ORDER BY year asc limit 10; +----------------+ | year | total | +----------------+ | "2004" | 64 | | "2005" | 844 | | "2006" | 4504 | | "2007" | 11833 | | "2008" | 20729 | | "2009" | 33965 | | "2010" | 53046 | | "2011" | 70331 | | "2012" | 62596 | | "2013" | 57330 | +----------------+
  • 51.
    Yelp – Social- PageRank call algo.pageRank.stream('User','FRIENDS') yield node,score with node,score order by score desc limit 10 return node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny}, score, size( (node)<-[:FRIENDS]-()<-[:FRIENDS]-()) as in, size( (node)-[:FRIENDS]->()-[:FRIENDS]->()) as out; +-----------------------------------------------------------------------------------------------------------------------------------------------------+ | node | score | +-----------------------------------------------------------------------------------------------------------------------------------------------------+ | {funny -> 61200, name -> "Philip", average_stars -> 3.93, review_count -> 788, useful -> 69448, yelping_since -> "2007-06-09"} | 208.31336799999994 | | {funny -> 21432, name -> "Des", average_stars -> 3.88, review_count -> 78, useful -> 140024, yelping_since -> "2014-04-01"} | 201.28600150000003 | | {funny -> 465, name -> "Dallas", average_stars -> 4.17, review_count -> 330, useful -> 5517, yelping_since -> "2010-11-07"} | 192.164762 | | {funny -> 1019, name -> "Cara", average_stars -> 3.96, review_count -> 842, useful -> 11738, yelping_since -> "2010-07-21"} | 184.01898249999996 | | {funny -> 1233, name -> "Walker", average_stars -> 3.91, review_count -> 462, useful -> 12332, yelping_since -> "2007-01-25"} | 180.48898350000005 | | {funny -> 13432, name -> "Gabi", average_stars -> 4.05, review_count -> 1730, useful -> 20759, yelping_since -> "2007-08-10"} | 163.29424850000004 | | {funny -> 12848, name -> "Ruggy", average_stars -> 3.92, review_count -> 2118, useful -> 72265, yelping_since -> "2007-07-31"} | 161.87635500000002 | | {funny -> 9997, name -> "Bill", average_stars -> 3.38, review_count -> 595, useful -> 12074, yelping_since -> "2014-04-05"} | 157.0438075 | | {funny -> 1544, name -> "Ashley", average_stars -> 3.7, review_count -> 224, useful -> 1610, yelping_since -> "2009-09-29"} | 150.21423599999997 | | {funny -> 3599, name -> "Risa", average_stars -> 4.08, review_count -> 1044, useful -> 22121, yelping_since -> "2011-07-30"} | 138.20863199999997 | +-----------------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows 3236 ms
  • 52.
    Yelp •Inferred network ofusers, via jointly reviewed businesses • (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User) • 1,3bn paths • Inferred network of businesses, via jointly reviewed by user • (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business) • 214m paths • subset: (b1:Business)-[:CO_OCCURENT_REVIEWS]-(b2:Business)
  • 53.
    Yelp •Inferred network ofusers, via jointly reviewed businesses • (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User) • 1.3bn paths • Inferred network of businesses, via jointly reviewed by user • (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business) • 214m paths
  • 54.
    Yelp – Business– Co-Occurrence •Find clusters of "similar" businesses •Find peer groups of similar people •Clusters of "interests"
  • 55.
    Yelp – Business– Co-Occurrence CALL apoc.periodic.iterate( 'MATCH (b:Business) WHERE size((b)<-[:REVIEWS]-()) > 5 AND b.city="Las Vegas" RETURN b', 'MATCH (b)<-[:REVIEWS]-(r1)<-[:WROTE]-(u)-[:WROTE]->(r2)-[:REVIEWS]->(b2) WHERE id(b) < id(b2) AND b2.city="Las Vegas" AND size((b2)<-[:REVIEWS]-()) > 5 AND r1.stars = r2.stars WITH b, b2, count(*) AS weight, avg(r1.stars) as rating where weight > 5 MERGE (b)-[cr:B2B]-(b2) ON CREATE SET cr.weight = weight, cr.rating = rating SET b:Marked, b2:Marked', {batchSize: 1});
  • 56.
    Yelp - ClusteringUnion Find CALL algo.unionFind.stream( 'MATCH (b:Business:Marked) RETURN id(b) as id’, 'MATCH (b1:Business:Marked)-[r:B2B]-(b2) RETURN id(b1) as source, id(b2) as target, count(r) as value', {graph:'cypher'}) YIELD setId as cluster, nodeId RETURN cluster, count(*) as size ORDER BY size DESC LIMIT 10; +--------------+ |cluster| size | +--------------+ | 3 | 5625 | | 1876 | 3 | | 155 | 2 | | 1091 | 2 | | 1728 | 2 | | 1177 | 2 | | 337 | 2 | | 3046 | 2 | | 674 | 2 | | 1948 | 2 | +--------------+ 10 rows 6615 ms
  • 57.
    Yelp - PageRank CALLalgo.pageRank.stream( 'MATCH (b:Business:Marked) RETURN id(b) as id', 'MATCH (b1:Business:Marked)-[r:B2B]-(b2) RETURN id(b1) as source, id(b2) as target', {graph:'cypher'}) YIELD node, score RETURN node.name, score ORDER BY score DESC LIMIT 10; +-------------------------------------------------------+ | node.name | score | +-------------------------------------------------------+ | "McCarran International Airport" | 27.49973599999999 | | "Hash House A Go Go" | 19.062398000000005 | | "Bachi Burger" | 18.1494385 | | "Mon Ami Gabi" | 17.720350000000003 | | "Bacchanal Buffet" | 15.783480500000003 | | "Yard House Town Square" | 14.427296999999998 | | "Secret Pizza" | 13.156547 | | "Rollin Smoke Barbeque" | 12.808718499999998 | | "Wicked Spoon" | 12.639942499999997 | | "Monta Ramen" | 12.3904845 | +-------------------------------------------------------+ 10 rows 6979 ms
  • 58.
  • 59.
    BitCoin Graph • FullCopy of the BitCoin BlockChain • from learnmeabitcoin.com (Greg Walker) • 1.7 billion nodes, 2.7 billion rels • 474k blocks, 240m tx, 280m addresses, 650m outputs • 600 GB on disk
  • 60.
  • 61.
    BitCoin Graph Distribution of"locked" relationships for "addresses" (participation in transactions) call apoc.stats.degrees('<locked'); +--------------------------------------------------------------------------------------------------------------+ | type | direction | total | p50 | p75 | p90 | p95 | p99 | p999 | max | min | mean | +--------------------------------------------------------------------------------------------------------------+ | "locked" | "INCOMING" | 654662356 | 0 | 0 | 1 | 1 | 2 | 28 | 1891327 | 0 | 0.37588608290716047 | +--------------------------------------------------------------------------------------------------------------+ 1 row 308 seconds
  • 62.
    BitCoin Graph Inferred networkof addresses, via transaction and output (a1)<-[:locked]-(o1)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2) CALL algo.unionFind.stream( 'match (o:output)-[:locked]->(a) with a limit 10000000 return id(a) as id', 'match (o:output)-[:locked]->(a) with o limit 10000000 match (o)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2) return id(a) as source, id(a2) as target, count(tx) as weight', {graph:'cypher'}) YIELD setId as cluster, nodeId RETURN cluster, count(*) AS size ORDER BY size DESC LIMIT 10; +-------------------+ | cluster | size | +-------------------+ | 5036 | 4409420 | | 6295282 | 1999 | | 5839746 | 1488 | | 9356302 | 833 | | 6560901 | 733 | | 6370777 | 637 | | 8101710 | 392 | | 5945867 | 369 | | 2489036 | 264 | | 1703620 | 203 | +-------------------+ 10 rows, 296 seconds
  • 63.
  • 64.
    Design Considerations • Easeof Use – Call as Procedures • Parallelize everything: load, compute, write • Efficiency: Use direct access, efficient datastructures, provide high-level API • Scale to billions of nodes and relationships Use up to hundreds of CPUs and Terabytes of RAM
  • 65.
    1. Load Datain parallel from Neo4j 2. Store in efficient data structures 3. Run Graph Algorithm in parallel using Graph API 4. Write data back in parallel Neo4j 1, 2 Algorithm Datastructures 4 3 Graph API Architecture
  • 66.
  • 67.
    Neo4j Graph Platformwith Neo4j Algorithms vs. Apache Spark’s GraphX 0 50 100 150 200 250 300 350 400 450 Union-Find (Connected Components) PageRank 251 Seconds 152 416 124 Neo4j is Significantly Faster Spark GraphX results publicly available • Amazon EC2 cluster running 64-bit Linux • 128 CPUs with 68 GB of memory, 2 hard disks Neo4j Configuration • Physical machine running 64-bit Linux • 128 CPUs with 55 GB RAM, SSDs Twitter 2010 Dataset • 1.47 Billion Relationships • 41.65 Million Nodes GraphX Neo4j Neo4j GraphX
  • 68.
    Compute At Scale– Payment Graph 3,000,000,000 nodes and 18,000,000,000 relationships (600G) PageRank (20 iterations) on 1 machine, 20 threads, 900G RAM call algo.pageRank('Account','SENT', {graph:'huge',iterations:20,write:true,concurrency:20}); +-------------------------------------------------------------------+ | nodes | iterations | loadMillis | computeMillis | writeMillis | +-------------------------------------------------------------------+ | 300000000 | 20 | 401404 | 6024994 | 47106 | +-------------------------------------------------------------------+ 1 row 6473526 ms -> 1h 47min
  • 69.
    We Need YourFeedback • neo4j.com/slack at #neo4j-graph-algorithms • github.com/neo4j-contrib/neo4j-graph-algorithms • Whitepaper on neo4j.com/graph-analytics
  • 70.
    Graphs are oneof the Unifying Themes of computer science . . . That so many different structures can be modeled using a single formalism is a Source of Great Power to the educated programmer.” - Steven S. Skiena, The Algorithm Design Manual “
  • 71.
    Kudos: Paul Horn Martin Knoblochfrom Avantgarde Labs Tomasz Bratanic (docs)
  • 72.