Graph Data Science at Scale

Neo4j, Inc. All rights reserved 2021
1
Graph Data Science
at SCALE
Jaimie Chung
Product Manager, Graph Data Science

3
Relationships
are the strongest
predictors of behavior
But You Can’t Analyse
What You Can’t See
● Most data science techniques
ignore relationships
● It’s painful to manually engineer
connected features from tabular
data
● Graphs are built on
relationships, so…
● You don’t have to guess at the
correlations: with graphs,
relationships are built in
James Fowler

4
4 Top 10 Tech Trends in Data and Analytics, 16 Feb 2021
According to Gartner, “Graphs form
the foundation of modern D&A,
with capabilities to enhance and
improve user collaboration, ML models
and explainable AI.
The recent Gartner AI in Organizations
Survey demonstrates that graph
techniques are increasingly
prevalent as AI maturity grows,
going from 13% adoption when AI
maturity is lowest to 48% when
maturity is highest.”
AI Research Papers
Featuring Graph
Source: Dimensions Knowledge System
4x
Increase in
traﬃc to
Neo4j GDS
page in
2H-2020
Analytics & Data Science Interest
Exploding in Neo4j Community
100k+
Practicing data
scientists
engaged with
Neo4j
+210k
downloads

5
Graphs & Data Science
Knowledge Graphs
Graph Algorithms
Graph Native
Machine Learning
Find the patterns you’re
looking for in connected data.
Use unsupervised machine
learning techniques to
identify associations,
anomalies, and trends.
Use embeddings to learn the
features in your graph that
you don’t even know are
important yet.
Train in-graph supervised ML
models to predict links,
labels, and missing data.

Neo4j’s Graph Data Science Framework
Neo4j Graph Data
Science Library
Neo4j
Database
Neo4j
Bloom
Scalable Graph Algorithms &
Analytics Workspace
Native Graph Creation &
Persistence
Visual Graph
Exploration & Prototyping

Robust Graph Algorithms & ML methods
● Compute metrics about the topology and connectivity
● Build predictive models to enhance your graph
● Highly parallelized and scalable
7
The Neo4j GDS Library
Mutable In-Memory
Workspace
Computational Graph
Native Graph Store
Efficient & Flexible Analytics Workspace
● Automatically reshapes transactional graphs into
an in-memory analytics graph
● Optimized for global traversals and aggregation
● Create workflows and layer algorithms
● Store and manage predictive models in the
model catalog

Our Secret Sauce: The In-Memory Graph
• Neo4j automates data
transformations
• Experiment with different data
sets, data models
• Mutable representation to chain
operations
• Production ready features,
parallelization & enterprise
support
• Ability to persist and version
data
GDS is fast and scalable because we transform your transactional graph
into a custom built data structure, optimized for parallel processing
Mutable In-Memory Workspace
Computational Graph
Native Graph Store

10
Outline
1 2
Architecture
How do you get data in?
And what’s the right model?
Enterprise
The enterprise edition of
GDS includes critical
features for scale.
3 4
Algorithms
Some algorithms are better
choices than others when
you’ve got a lot of data.
Case studies
Let’s talk about some
customers with big data sets
- and how they use GDS.

11
Architecture

12
GDS runs large dedicated single instances
that execute algorithm only workloads
• Data is typically imported from data lake or
data warehouse
• Data is updated in batches on some regular
interval
• Output used for offline scoring or manual
review by analytics teams
GDS does not run on a cluster
12
Architecture: GDS Instances
Cluster
Standalone
GDS Instance

13
Architecture: Database Sizing
GDS requirements:
● The amount of memory (heap) available determines if
something can run
● The number of CPUs determines how fast something will
run
Use estimator functions to know how much
memory you need for your workflow
Don’t forget about the High Limit store format for really
big datasets

Read replicas for data science
workflows can be used as:
• Analytics instances with dedicated
capacity for querying/reporting
without interrupting algorithms
• Visualization server for bloom
• Warm backup for disaster recovery
Architecture: Read Replicas

15
Architecture: Data Import
Use Case
Requirements
Fastest method:
Load data into an empty
database using
admin-import
For deltas:
Consider how often you need
to load data and use
apoc.periodic.iterate

16
Architecture: Data Models
Choose a data model fit for the algorithms
you want to run
Most graph algorithms expect monopartite
graphs, but some expect multipartite graphs
Or that can be manipulated using native
graph loaders
E.g. collapsePath to create a monopartite
graph
Monopartite graph
Multipartite graph

17
Enterprise Edition

18
Enterprise-Only Features for Scalability
GDS Enterprise Edition is built for scale and will maximize your odds for success!
GDS EE uses a special graph
compression technique that uses
up to 75% less memory than
community edition.
GDS algorithms are parallelized:
EE lets you set concurrency > 4,
so your algorithms compute as
quickly as possible.
Enterprise Graph
Compression
Unlimited
Parallelization

19
Our Implementations are Fast - and Getting Faster
LDBC100 Benchmark
(LDBC Social Network Scale Factor 100)
300M+ nodes
2B+ relationships
LDBC100PKP
(LDBC Social Network Scale Factor 100)
500k nodes
46M+ relationships
Logical Cores: 64
Memory: 512GB
Storage: 600GB
NVMe-SSD
AWS EC2 R5D16XLarge
Intel Xeon Platinum 8000
(Skylake-SP or Cascade Lake)
Node Similarity
20min
Betweenness Centrality
10min
Node2Vec
2.8min
Label Propagation
46sec
Weakly Connected
Components
36sec
Triangle Counting
24.8min
Local Clustering
Coefficient
4.76min
FastRP
1.33min
PageRank
53sec
Louvain
14.66min

20
Parallel Processing Means Better Performance

21
Parallel Processing Means Better Performance

22
Algorithms

Choosing Algorithms: Complexity
Consider computational complexity when
choosing algorithms
Examples:
• Betweenness Centrality: traverses nodes
multiple times
• All Pairs: traverses multiple paths in the graph
• Node Similarity: compares every node with
every other node

Choosing Algorithms: Substitutions
Node2Vec and graphSAGE are easy to
understand but memory intensive.
GraphSAGE or Node2Vec
Node Similarity has been optimized,
but it’s still computationally intensive.
Node Similarity
Everyone loves Louvain for ﬁnding fraud
rings, but it doesn’t parallelize linearly.
Louvain
Instead of
FastRP and FastRPExtended can
calculate results for millions of nodes
in seconds, and perform well!
FastRP
KNN is an approximate nearest
neighbors algorithm and you can
adjust the sampling rate for speed.
KNN
Label propagation uses a much
faster algorithm - that parallelizes
well - to ﬁnd communities.
Label Propagation
Choose

Running Algorithms: Native Projection
Native projections are orders of magnitude
faster than cypher projections
Techniques for native projections:
• collapsePath:updates your in memory
graph to traverse a specified pattern and create
relationships between start and end nodes
• Relationship aggregations
• Graph filtering
gds.beta.graph.create.subgraph

Running Algorithms: Named Graphs
Use named graphs, not anonymous
graphs with gds.graph.create
Advantages:
• Decouples graph loading, algorithm
execution, and writeback
• Can run more than one algorithm
without loading each time

Running Algorithms: Pre-processing
Use subgraph filtering to preprocess
your data: graph.create.subgraph
Use cases:
• Remove dense nodes that slow calculations
• Remove orphan nodes that are
uninformative
• Isolate communities and execute algorithms
on multiple subgraphs

Running Algorithms: Concurrency
Don’t forget about concurrency!
Every algorithm supports the concurrency parameter

29
Case Studies

30
Client: Top Media Conglomerate
Graph: Cookie graph with tens of billions of nodes
and hundreds of billions of relationships
● High limit store format
● Data model: simple
● Import: daily data refresh from data warehouse
● Workflow: only one algorithm (WCC) that runs
daily
Identity Disambiguation

Client: Top Retailer
Graph: Insights graph with hundreds of millions of nodes
and more than a billion relationships
● Data model: complex, heterogeneous nodes and
relationship types
● Import: periodic data load from data warehouse
● Workflow:
○ Offline analysis
○ Generating graph embeddings using
heterogeneous nodes and more than one
relationship type, which requires a pipeline
with multiple algorithms chained together
Search Relevance and
Product Recommendations

Client: Top Video Streaming Platform
Graph: Customer event tracking graph with billions of
nodes and tens of billions of relationships
● Data model: complex, heterogeneous nodes and
relationship types
● Import: monthly data refresh
● Workflow:
○ Offline analysis
○ Requires a pipeline with multiple algorithms
chained together
Customer Journey

33
Unmatched Power
Continually adding more
graph algorithms,
embeddings, & in-graph ML
Extensible
Integrate with other data
sources and ML platforms
Streamlined
In-platform transformations
and reshaping for fast
iteration
Scalable Data Science
Customers in production with
over 10’s billions of nodes
Strongest Community
220K+ practioners
72K+ meetups
Flexible Deployment
On-prem or in the Cloud

34
Questions?

Graph Data Science at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graph Data Science at Scale

Similar to Graph Data Science at Scale (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Graph Data Science at Scale