Graph Analytics: Graph Algorithms Inside Neo4j

Optimized Graph
Algorithms in Neo4j
Use the Power of Connections to Drive Discovery
January 2018
Mark Needham
Amy Hodler

Mark Needham
Software Engineer, Neo4j
mark.needham@neo4j.com
@markhneedham
Next 50 Minutes
• Why Use Graph Analytics
• Randomness vs. Reality
• Graph Analytics Takes Off
• How to Run Graph Analytics
• Neo4j Graph Analytics and Algorithms
• Demos and Implementation
Graph
Algorithms
Real-World
Networks
Amy E. Hodler
Analytics Marketing, Neo4j
amy.hodler@neo4j.com
@amyhodler

Understand. Predict. Prescribe.

Forecast Complex Network Behavior
and Prescribe Action

Cascading Failures
Airline Congestion - 2010
Source: “Systemic delay propagation in the US airport network” – Fleurquin, Ramasco, Eguiluz

Planning and Least
Cost Routing

Bridge Points
Languages – Telecom Network
Source: “Fast unfolding of communities in large networks” – Blondel, Guillaume, Lambiotte, Lefebvre

Extract Structure and Model Processes

Preferential
Attachment
Nodes tend to link to nodes
that already have a lot of links
Origins Debated
• Local Mechanisms
• Global Optimization
• Mixed or Other
Network Structures are Inseparable from Development

Concentrated
Distribution
Source: “How Stuff Spreads” – Pulsar Platform
NodeswithkLinks
Number of links (k)
Many nodes with only
a few links
A few hubs with a
large number of links
Power Law Distribution

“There is No Network in Nature that we
know of that would be described by the
Random network model.”
- Albert-László Barabási

Small-World
High local clustering
and short average path
lengths. Hub and spoke
architecture.
Scale-Free
Hub and spoke
architecture preserved
at multiple scales. High
power law distribution.
Random
Average distributions.
No structure or
hierarchical patterns.

The Lure of Averages
Source: Network Science - Barabasi
Art: Ulysses and the Sirens – Herbert James Draper
NodeswithkLinks
Number of Links (k)
Average Distribution
- Random -
Most nodes have the
same number of links
No highly
connected nodes

Resist The Lure
of AveragesNodeswithkLinks
Number of Links (k)
- Random -
Most nodes have the
No highly
connected nodes
NodeswithkLinks
Number of links (k)
Power Law Distribution
- Scale-Free -
Many nodes with only
a few links
A few hubs with a
large number of links

Resist The Lure
of AveragesNodeswithkLinks
Number of Links (k)
- Random -
Art: Ulysses and the Sirens – Herbert James Draper
Most nodes have the
No highly
connected nodes
You’ll Miss the Structure
Hidden in Your Networks
- Scale-Free -
- Small World -

#Finally!
Leonhard Euler 1707-1783

Critical Mass
• Collect, share and analyze
massive connected data
• Discovered common
principles and structures
• Existing mathematical tools
• Unfulfilled promises of
big data

Insights from Algorithms
Graph Algorithms
• Metrics
• Relevance
• Clustering
• Structural Insights
Machine Learning
• Classification, Regression
• NLP, Structural/Content
Predictions
• Neural Networks as Graphs
• Graph As Compute Fabric

Structures Can Hide
Source: “Communities, modules and large-scale structure in networks“ - Mark Newman
Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and
inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.

Graph of Thrones
A. Beveridge: GoT - Interaction Graph from Books

Existing Options (so far)
•Data Processing
•Spark with GraphX, Flink with Gelly
•Dedicated Graph Processing
• Urika, GraphLab, Giraph, Mosaic, GPS, Signal-Collect,
Gradoop
•Data Scientist Toolkit
• igraph, NetworkX, Boost(graph-tool) in Python, R, C

Drawbacks
• Manage several tools
• Selection -> learning ->
installation -> operation
• Data selection, projection and
transfer
• Tedious and time consuming
• Scalability
• Especially classic data
science tools

An Example
From Past GraphConnect

Source: John Swain - Twitter Analytics Right Relevance Talk

Many Moving Parts!
Example Workflow Pipeline
Twitter
Streaming API
Python Tweet
Collection
(includes user
data)
Rabbit
MQ
MongoDB
Neo4j
R Scripts
-Graph Stats
-Community
Detection
MySQL
Graph
.graphml
Tableau
Graph
Visualization
Moved from Twitter
Search API to
Streaming API
Replaced Python
Twitter libraries
(Tweepy) with raw API
calls
Streaming tweets in message queue
Full tweets and user data stored in
MongoDB
Built graph for analysis in Neo4j from
tweets persisted in MongoDB
Analysis in R
iGraph libraries for
algorithms
Some text analysis e.g.
LDA topics
Results published in MySQL
for Tableau
Graphml for import to Gephi
with stats precalculated

Our Goal
Twitter
Streaming API
Python Tweet
Collection
(includes user
data)
Rabbit
MQ
MongoDB
Neo4j
R Scripts
-Graph Stats
-Community
Detection
MySQL
Graph
.graphml
Tableau
Graph
Visualization
Example Workflow Pipeline

Neo4j Graph Analytics
and Algorithms

Neo4j
Native Graph
Database
Analytics
Integrations
Cypher Query
Language
Wide Range of
APOC Procedures
Optimized
Graph Algorithms

Finds the optimal path
or evaluates route
availability and quality
Evaluates how a
group is clustered
or partitioned
Determines the
importance of distinct
nodes in the network

1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and configuration
3. ~.stream variant returns (a lot) of results
CALL algo.<name>.stream('Label','TYPE',{conf})
YIELD nodeId, score
4. non-stream variant writes results to graph returns statistics
CALL algo.<name>('Label','TYPE',{conf})
Usage

Pass in Cypher statement for node- and relationship-lists.
CALL algo.<name>(
'MATCH ... RETURN id(n)',
'MATCH (n)-->(m)
RETURN id(n) as source,
id(m) as target', {graph:'cypher'})
Cypher Projection

• PageRank (baseline)
• Betweeness
• Closeness
• Degree
Algorithms - Centralities
Pathfinding
Centrality
Community
Detection

• Label Propagation
• Union Find / WCC
• Strongly Connected Components
• Louvain
• Triangle-Count / Clustering Coefficent
Algorithms – Communitity Detection
Pathfinding
Community
Detection
Centrality

• Single Source Short Path
• All-Nodes SSP
• Parallel BFS / DFS
Algorithms - Pathfinding
Centrality Community
Detection
Pathfinding

Iterate Quickly
• Combine data from sources into one graph
• Project to relevant subgraphs
• Enrich data with algorithms
• Traverse, collect, filter aggregate with queries
• Visualize, Explore, Decide, Export
• From all APIs and Tools

Datasets
Yelp Business Graph
• 5m nodes
• 17m relationships
Bitcoin
• 1.7bn nodes,
• 2.7bn rels
DBPedia
• 11m nodes
• 116m relationships

DBPedia
Shallow Copy of Wikipedia: (Page) -[:Link]-> (Page)
CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score
WITH *
ORDER BY score DESC
LIMIT 5
RETURN node.title, score;
+--------------------------------------+
| node.title | score |
+--------------------------------------+
| "United States" | 13349.2 |
| "Animal" | 6077.77 |
| "France" | 5025.61 |
| "List of sovereign states" | 4913.92 |
| "Germany" | 4662.32 |
+--------------------------------------+
5 rows 46 seconds

DBPedia – Largest Clusters
CALL algo.labelPropagation();
// First 1M pages by Rank
MATCH (n:Page)
WITH n
ORDER BY n.pagerank DESC
LIMIT 1000000
// group by partition
WITH n.partition AS partition,
count(*) AS clusterSize,
collect(n.title) AS pages
// return most influential node for largest clusters
RETURN pages[0] AS mainPage,
pages[1..10] AS otherPages
ORDER BY clusterSize DESC
LIMIT 20

Yelp
• Business Reviews by Users
•Businesses have Categories and Locations
•Users have Friends
•Bi-partite-Network (:User)-->(:Business)
projections (:User)<-->(:User) &
(:Business)<-->(:Business)

Yelp – Social - Statistics
MATCH (u:User) where exists ( (u)-[:FRIENDS]-() )
WITH u.average_stars as stars, u.review_count as reviews, u.funny as funny
RETURN max(stars),avg(stars),stdev(stars),max(reviews),avg(reviews),stdev(reviews),max(funny),avg(funny),stdev(funny);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| max(stars) | avg(stars) | stdev(stars) | max(reviews) | avg(reviews) | stdev(reviews) | max(funny) | avg(funny) | stdev(funny) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 5.0 | 3.8238072950764947 | 0.8862511758625753 | 11284 | 45.81704314022204 | 120.52419266925014 | 170896 | 36.26637835535585 | 731.6024752545679 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
MATCH (u:User) where exists ( (u)-[:FRIENDS]-() )
WITH u.yelping_since as since
RETURN substring(since,0,4) as year, count(*) as total
ORDER BY year asc limit 10;
+----------------+
| year | total |
+----------------+
| "2004" | 64 |
| "2005" | 844 |
| "2006" | 4504 |
| "2007" | 11833 |
| "2008" | 20729 |
| "2009" | 33965 |
| "2010" | 53046 |
| "2011" | 70331 |
| "2012" | 62596 |
| "2013" | 57330 |
+----------------+

Yelp – Social - PageRank
call algo.pageRank.stream('User','FRIENDS')
yield node,score with node,score
order by score desc limit 10
return node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny},
score,
size( (node)<-[:FRIENDS]-()<-[:FRIENDS]-()) as in,
size( (node)-[:FRIENDS]->()-[:FRIENDS]->()) as out;
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| node | score |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| {funny -> 61200, name -> "Philip", average_stars -> 3.93, review_count -> 788, useful -> 69448, yelping_since -> "2007-06-09"} | 208.31336799999994 |
| {funny -> 21432, name -> "Des", average_stars -> 3.88, review_count -> 78, useful -> 140024, yelping_since -> "2014-04-01"} | 201.28600150000003 |
| {funny -> 465, name -> "Dallas", average_stars -> 4.17, review_count -> 330, useful -> 5517, yelping_since -> "2010-11-07"} | 192.164762 |
| {funny -> 1019, name -> "Cara", average_stars -> 3.96, review_count -> 842, useful -> 11738, yelping_since -> "2010-07-21"} | 184.01898249999996 |
| {funny -> 1233, name -> "Walker", average_stars -> 3.91, review_count -> 462, useful -> 12332, yelping_since -> "2007-01-25"} | 180.48898350000005 |
| {funny -> 13432, name -> "Gabi", average_stars -> 4.05, review_count -> 1730, useful -> 20759, yelping_since -> "2007-08-10"} | 163.29424850000004 |
| {funny -> 12848, name -> "Ruggy", average_stars -> 3.92, review_count -> 2118, useful -> 72265, yelping_since -> "2007-07-31"} | 161.87635500000002 |
| {funny -> 9997, name -> "Bill", average_stars -> 3.38, review_count -> 595, useful -> 12074, yelping_since -> "2014-04-05"} | 157.0438075 |
| {funny -> 1544, name -> "Ashley", average_stars -> 3.7, review_count -> 224, useful -> 1610, yelping_since -> "2009-09-29"} | 150.21423599999997 |
| {funny -> 3599, name -> "Risa", average_stars -> 4.08, review_count -> 1044, useful -> 22121, yelping_since -> "2011-07-30"} | 138.20863199999997 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows
3236 ms

Yelp
•Inferred network of users, via jointly reviewed businesses
• (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User)
• 1,3bn paths
• Inferred network of businesses, via jointly reviewed by user
• (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business)
• 214m paths
• subset: (b1:Business)-[:CO_OCCURENT_REVIEWS]-(b2:Business)

Yelp
•Inferred network of users, via jointly reviewed businesses
• (u1:User)-[:WROTE]->(review1)-[:REVIEWS]->(business)<-[:REVIEWS]-(review2)<-[:WROTE]-(u2:User)
• 1.3bn paths
• Inferred network of businesses, via jointly reviewed by user
• (b1:Business)<-[:REVIEWS]-()<-[:WROTE]-(u)-[:WROTE]->()-[:REVIEWS]->(b2:Business)
• 214m paths

Yelp – Business – Co-Occurrence
•Find clusters of "similar" businesses
•Find peer groups of similar people
•Clusters of "interests"

Yelp – Business – Co-Occurrence
CALL apoc.periodic.iterate(
'MATCH (b:Business)
WHERE size((b)<-[:REVIEWS]-()) > 5 AND b.city="Las Vegas"
RETURN b',
'MATCH (b)<-[:REVIEWS]-(r1)<-[:WROTE]-(u)-[:WROTE]->(r2)-[:REVIEWS]->(b2)
WHERE id(b) < id(b2) AND b2.city="Las Vegas"
AND size((b2)<-[:REVIEWS]-()) > 5
AND r1.stars = r2.stars
WITH b, b2, count(*) AS weight, avg(r1.stars) as rating where weight > 5
MERGE (b)-[cr:B2B]-(b2)
ON CREATE SET cr.weight = weight, cr.rating = rating
SET b:Marked, b2:Marked',
{batchSize: 1});

Yelp - Clustering Union Find
CALL algo.unionFind.stream(
'MATCH (b:Business:Marked) RETURN id(b) as id’,
'MATCH (b1:Business:Marked)-[r:B2B]-(b2)
RETURN id(b1) as source,
id(b2) as target,
count(r) as value',
{graph:'cypher'}) YIELD setId as cluster, nodeId
RETURN cluster, count(*) as size
ORDER BY size DESC LIMIT 10;
+--------------+
|cluster| size |
+--------------+
| 3 | 5625 |
| 1876 | 3 |
| 155 | 2 |
| 1091 | 2 |
| 1728 | 2 |
| 1177 | 2 |
| 337 | 2 |
| 3046 | 2 |
| 674 | 2 |
| 1948 | 2 |
+--------------+
10 rows
6615 ms

Yelp - PageRank
CALL algo.pageRank.stream(
'MATCH (b:Business:Marked)
RETURN id(b) as id',
'MATCH (b1:Business:Marked)-[r:B2B]-(b2)
RETURN id(b1) as source,
id(b2) as target',
{graph:'cypher'})
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 10;
+-------------------------------------------------------+
| node.name | score |
+-------------------------------------------------------+
| "McCarran International Airport" | 27.49973599999999 |
| "Hash House A Go Go" | 19.062398000000005 |
| "Bachi Burger" | 18.1494385 |
| "Mon Ami Gabi" | 17.720350000000003 |
| "Bacchanal Buffet" | 15.783480500000003 |
| "Yard House Town Square" | 14.427296999999998 |
| "Secret Pizza" | 13.156547 |
| "Rollin Smoke Barbeque" | 12.808718499999998 |
| "Wicked Spoon" | 12.639942499999997 |
| "Monta Ramen" | 12.3904845 |
+-------------------------------------------------------+
10 rows
6979 ms

BitCoin Graph
• Full Copy of the BitCoin BlockChain
• from learnmeabitcoin.com (Greg Walker)
• 1.7 billion nodes, 2.7 billion rels
• 474k blocks, 240m tx, 280m addresses, 650m outputs
• 600 GB on disk

BitCoin Graph
Distribution of "locked" relationships for "addresses"
(participation in transactions)
call apoc.stats.degrees('<locked');
+--------------------------------------------------------------------------------------------------------------+
| type | direction | total | p50 | p75 | p90 | p95 | p99 | p999 | max | min | mean |
+--------------------------------------------------------------------------------------------------------------+
| "locked" | "INCOMING" | 654662356 | 0 | 0 | 1 | 1 | 2 | 28 | 1891327 | 0 | 0.37588608290716047 |
+--------------------------------------------------------------------------------------------------------------+
1 row
308 seconds

BitCoin Graph
Inferred network of addresses, via transaction and output
(a1)<-[:locked]-(o1)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2)
CALL algo.unionFind.stream(
'match (o:output)-[:locked]->(a) with a limit 10000000 return id(a) as id',
'match (o:output)-[:locked]->(a) with o limit 10000000
match (o)-[:in]->(tx)-[:out]->(o2)-[:locked]->(a2)
return id(a) as source, id(a2) as target, count(tx) as weight',
{graph:'cypher'})
YIELD setId as cluster, nodeId
RETURN cluster, count(*) AS size
ORDER BY size DESC
LIMIT 10;
+-------------------+
| cluster | size |
+-------------------+
| 5036 | 4409420 |
| 6295282 | 1999 |
| 5839746 | 1488 |
| 9356302 | 833 |
| 6560901 | 733 |
| 6370777 | 637 |
| 8101710 | 392 |
| 5945867 | 369 |
| 2489036 | 264 |
| 1703620 | 203 |
+-------------------+
10 rows, 296 seconds

Design Considerations
• Ease of Use – Call as Procedures
• Parallelize everything: load, compute, write
• Efficiency: Use direct access, efficient datastructures, provide
high-level API
• Scale to billions of nodes and relationships
Use up to hundreds of CPUs and Terabytes of RAM

1. Load Data in parallel
from Neo4j
2. Store in efficient data
structures
3. Run Graph Algorithm
in parallel using
Graph API
4. Write data back in
parallel
Neo4j
1, 2
Algorithm
Datastructures
4
3
Graph API
Architecture

Neo4j Graph Platform with Neo4j Algorithms
vs. Apache Spark’s GraphX
0
50
100
150
200
250
300
350
400
450
Union-Find (Connected Components) PageRank
251
Seconds
152
416
124
Neo4j is
Significantly
Faster
Spark GraphX results publicly available
• Amazon EC2 cluster running 64-bit Linux
• 128 CPUs with 68 GB of memory, 2 hard disks
Neo4j Configuration
• Physical machine running 64-bit Linux
• 128 CPUs with 55 GB RAM, SSDs
Twitter 2010 Dataset
• 1.47 Billion Relationships
• 41.65 Million Nodes
GraphX
Neo4j
Neo4j
GraphX

Compute At Scale – Payment Graph
3,000,000,000 nodes and 18,000,000,000 relationships (600G)
PageRank (20 iterations) on 1 machine, 20 threads, 900G RAM
call algo.pageRank('Account','SENT',
{graph:'huge',iterations:20,write:true,concurrency:20});
+-------------------------------------------------------------------+
| nodes | iterations | loadMillis | computeMillis | writeMillis |
+-------------------------------------------------------------------+
| 300000000 | 20 | 401404 | 6024994 | 47106 |
+-------------------------------------------------------------------+
1 row 6473526 ms -> 1h 47min

We Need Your Feedback
• neo4j.com/slack at #neo4j-graph-algorithms
• github.com/neo4j-contrib/neo4j-graph-algorithms
• Whitepaper on neo4j.com/graph-analytics

Graphs are one of
the Unifying Themes of computer science . . .
That so many different structures
can be modeled using a single formalism
is a Source of Great Power
to the educated programmer.”
- Steven S. Skiena,
The Algorithm Design Manual
“

Kudos:
Paul Horn
Martin Knobloch from Avantgarde Labs
Tomasz Bratanic (docs)

Graph Analytics: Graph Algorithms Inside Neo4j

More Related Content

What's hot

Similar to Graph Analytics: Graph Algorithms Inside Neo4j

More from Neo4j

Recently uploaded

Graph Analytics: Graph Algorithms Inside Neo4j