Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
1
Graph Data Science
at SCALE
Jaimie Chung
Product Manager, Graph Data Science
2
What is Graph Data Science?
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
3
Relationships
are the strongest
predictors of behavior
But You Can’t Analyse
What You Can’t See
● Most data science techniques
ignore relationships
● It’s painful to manually engineer
connected features from tabular
data
● Graphs are built on
relationships, so…
● You don’t have to guess at the
correlations: with graphs,
relationships are built in
James Fowler
Neo4j, Inc. All rights reserved 2021
4
4 Top 10 Tech Trends in Data and Analytics, 16 Feb 2021
According to Gartner, “Graphs form
the foundation of modern D&A,
with capabilities to enhance and
improve user collaboration, ML models
and explainable AI.
The recent Gartner AI in Organizations
Survey demonstrates that graph
techniques are increasingly
prevalent as AI maturity grows,
going from 13% adoption when AI
maturity is lowest to 48% when
maturity is highest.”
AI Research Papers
Featuring Graph
Source: Dimensions Knowledge System
4x
Increase in
traffic to
Neo4j GDS
page in
2H-2020
Analytics & Data Science Interest
Exploding in Neo4j Community
100k+
Practicing data
scientists
engaged with
Neo4j
+210k
downloads
Neo4j, Inc. All rights reserved 2021
5
Graphs & Data Science
Knowledge Graphs
Graph Algorithms
Graph Native
Machine Learning
Find the patterns you’re
looking for in connected data.
Use unsupervised machine
learning techniques to
identify associations,
anomalies, and trends.
Use embeddings to learn the
features in your graph that
you don’t even know are
important yet.
Train in-graph supervised ML
models to predict links,
labels, and missing data.
Neo4j, Inc. All rights reserved 2021
Neo4j’s Graph Data Science Framework
Neo4j Graph Data
Science Library
Neo4j
Database
Neo4j
Bloom
Scalable Graph Algorithms &
Analytics Workspace
Native Graph Creation &
Persistence
Visual Graph
Exploration & Prototyping
Neo4j, Inc. All rights reserved 2021
Robust Graph Algorithms & ML methods
● Compute metrics about the topology and connectivity
● Build predictive models to enhance your graph
● Highly parallelized and scalable
7
The Neo4j GDS Library
Mutable In-Memory
Workspace
Computational Graph
Native Graph Store
Efficient & Flexible Analytics Workspace
● Automatically reshapes transactional graphs into
an in-memory analytics graph
● Optimized for global traversals and aggregation
● Create workflows and layer algorithms
● Store and manage predictive models in the
model catalog
Neo4j, Inc. All rights reserved 2021
Our Secret Sauce: The In-Memory Graph
• Neo4j automates data
transformations
• Experiment with different data
sets, data models
• Mutable representation to chain
operations
• Production ready features,
parallelization & enterprise
support
• Ability to persist and version
data
GDS is fast and scalable because we transform your transactional graph
into a custom built data structure, optimized for parallel processing
Mutable In-Memory Workspace
Computational Graph
Native Graph Store
9
How does GDS run at scale?
Neo4j, Inc. All rights reserved 2021
10
Outline
1 2
Architecture
How do you get data in?
And what’s the right model?
Enterprise
The enterprise edition of
GDS includes critical
features for scale.
3 4
Algorithms
Some algorithms are better
choices than others when
you’ve got a lot of data.
Case studies
Let’s talk about some
customers with big data sets
- and how they use GDS.
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
11
Architecture
Neo4j, Inc. All rights reserved 2021
12
GDS runs large dedicated single instances
that execute algorithm only workloads
• Data is typically imported from data lake or
data warehouse
• Data is updated in batches on some regular
interval
• Output used for offline scoring or manual
review by analytics teams
GDS does not run on a cluster
12
Architecture: GDS Instances
Cluster
Standalone
GDS Instance
Neo4j, Inc. All rights reserved 2021
13
Architecture: Database Sizing
GDS requirements:
● The amount of memory (heap) available determines if
something can run
● The number of CPUs determines how fast something will
run
Use estimator functions to know how much
memory you need for your workflow
Don’t forget about the High Limit store format for really
big datasets
Neo4j, Inc. All rights reserved 2021
Read replicas for data science
workflows can be used as:
• Analytics instances with dedicated
capacity for querying/reporting
without interrupting algorithms
• Visualization server for bloom
• Warm backup for disaster recovery
Architecture: Read Replicas
Neo4j, Inc. All rights reserved 2021
15
Architecture: Data Import
Use Case
Requirements
Fastest method:
Load data into an empty
database using
admin-import
For deltas:
Consider how often you need
to load data and use
apoc.periodic.iterate
Neo4j, Inc. All rights reserved 2021
16
Architecture: Data Models
Choose a data model fit for the algorithms
you want to run
Most graph algorithms expect monopartite
graphs, but some expect multipartite graphs
Or that can be manipulated using native
graph loaders
E.g. collapsePath to create a monopartite
graph
Monopartite graph
Multipartite graph
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
17
Enterprise Edition
Neo4j, Inc. All rights reserved 2021
18
Enterprise-Only Features for Scalability
GDS Enterprise Edition is built for scale and will maximize your odds for success!
GDS EE uses a special graph
compression technique that uses
up to 75% less memory than
community edition.
GDS algorithms are parallelized:
EE lets you set concurrency > 4,
so your algorithms compute as
quickly as possible.
Enterprise Graph
Compression
Unlimited
Parallelization
Neo4j, Inc. All rights reserved 2021
19
Our Implementations are Fast - and Getting Faster
LDBC100 Benchmark
(LDBC Social Network Scale Factor 100)
300M+ nodes
2B+ relationships
LDBC100PKP
(LDBC Social Network Scale Factor 100)
500k nodes
46M+ relationships
Logical Cores: 64
Memory: 512GB
Storage: 600GB
NVMe-SSD
AWS EC2 R5D16XLarge
Intel Xeon Platinum 8000
(Skylake-SP or Cascade Lake)
Node Similarity
20min
Betweenness Centrality
10min
Node2Vec
2.8min
Label Propagation
46sec
Weakly Connected
Components
36sec
Triangle Counting
24.8min
Local Clustering
Coefficient
4.76min
FastRP
1.33min
PageRank
53sec
Louvain
14.66min
Neo4j, Inc. All rights reserved 2021
20
Parallel Processing Means Better Performance
Neo4j, Inc. All rights reserved 2021
21
Parallel Processing Means Better Performance
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
22
Algorithms
Neo4j, Inc. All rights reserved 2021
Choosing Algorithms: Complexity
Consider computational complexity when
choosing algorithms
Examples:
• Betweenness Centrality: traverses nodes
multiple times
• All Pairs: traverses multiple paths in the graph
• Node Similarity: compares every node with
every other node
Neo4j, Inc. All rights reserved 2021
Choosing Algorithms: Substitutions
Node2Vec and graphSAGE are easy to
understand but memory intensive.
GraphSAGE or Node2Vec
Node Similarity has been optimized,
but it’s still computationally intensive.
Node Similarity
Everyone loves Louvain for finding fraud
rings, but it doesn’t parallelize linearly.
Louvain
Instead of
FastRP and FastRPExtended can
calculate results for millions of nodes
in seconds, and perform well!
FastRP
KNN is an approximate nearest
neighbors algorithm and you can
adjust the sampling rate for speed.
KNN
Label propagation uses a much
faster algorithm - that parallelizes
well - to find communities.
Label Propagation
Choose
Neo4j, Inc. All rights reserved 2021
Running Algorithms: Native Projection
Native projections are orders of magnitude
faster than cypher projections
Techniques for native projections:
• collapsePath:updates your in memory
graph to traverse a specified pattern and create
relationships between start and end nodes
• Relationship aggregations
• Graph filtering
gds.beta.graph.create.subgraph
Neo4j, Inc. All rights reserved 2021
Running Algorithms: Named Graphs
Use named graphs, not anonymous
graphs with gds.graph.create
Advantages:
• Decouples graph loading, algorithm
execution, and writeback
• Can run more than one algorithm
without loading each time
Neo4j, Inc. All rights reserved 2021
Running Algorithms: Pre-processing
Use subgraph filtering to preprocess
your data: graph.create.subgraph
Use cases:
• Remove dense nodes that slow calculations
• Remove orphan nodes that are
uninformative
• Isolate communities and execute algorithms
on multiple subgraphs
Neo4j, Inc. All rights reserved 2021
Running Algorithms: Concurrency
Don’t forget about concurrency!
Every algorithm supports the concurrency parameter
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
29
Case Studies
Neo4j, Inc. All rights reserved 2021
30
Client: Top Media Conglomerate
Graph: Cookie graph with tens of billions of nodes
and hundreds of billions of relationships
● High limit store format
● Data model: simple
● Import: daily data refresh from data warehouse
● Workflow: only one algorithm (WCC) that runs
daily
Identity Disambiguation
Neo4j, Inc. All rights reserved 2021
Client: Top Retailer
Graph: Insights graph with hundreds of millions of nodes
and more than a billion relationships
● Data model: complex, heterogeneous nodes and
relationship types
● Import: periodic data load from data warehouse
● Workflow:
○ Offline analysis
○ Generating graph embeddings using
heterogeneous nodes and more than one
relationship type, which requires a pipeline
with multiple algorithms chained together
Search Relevance and
Product Recommendations
Neo4j, Inc. All rights reserved 2021
Client: Top Video Streaming Platform
Graph: Customer event tracking graph with billions of
nodes and tens of billions of relationships
● Data model: complex, heterogeneous nodes and
relationship types
● Import: monthly data refresh
● Workflow:
○ Offline analysis
○ Requires a pipeline with multiple algorithms
chained together
Customer Journey
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
33
Unmatched Power
Continually adding more
graph algorithms,
embeddings, & in-graph ML
Extensible
Integrate with other data
sources and ML platforms
Streamlined
In-platform transformations
and reshaping for fast
iteration
Scalable Data Science
Customers in production with
over 10’s billions of nodes
Strongest Community
220K+ practioners
72K+ meetups
Flexible Deployment
On-prem or in the Cloud
Neo4j, Inc. All rights reserved 2021
Neo4j, Inc. All rights reserved 2021
34
Questions?

Graph Data Science at Scale

  • 1.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 1 Graph Data Science at SCALE Jaimie Chung Product Manager, Graph Data Science
  • 2.
    2 What is GraphData Science?
  • 3.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 3 Relationships are the strongest predictors of behavior But You Can’t Analyse What You Can’t See ● Most data science techniques ignore relationships ● It’s painful to manually engineer connected features from tabular data ● Graphs are built on relationships, so… ● You don’t have to guess at the correlations: with graphs, relationships are built in James Fowler
  • 4.
    Neo4j, Inc. Allrights reserved 2021 4 4 Top 10 Tech Trends in Data and Analytics, 16 Feb 2021 According to Gartner, “Graphs form the foundation of modern D&A, with capabilities to enhance and improve user collaboration, ML models and explainable AI. The recent Gartner AI in Organizations Survey demonstrates that graph techniques are increasingly prevalent as AI maturity grows, going from 13% adoption when AI maturity is lowest to 48% when maturity is highest.” AI Research Papers Featuring Graph Source: Dimensions Knowledge System 4x Increase in traffic to Neo4j GDS page in 2H-2020 Analytics & Data Science Interest Exploding in Neo4j Community 100k+ Practicing data scientists engaged with Neo4j +210k downloads
  • 5.
    Neo4j, Inc. Allrights reserved 2021 5 Graphs & Data Science Knowledge Graphs Graph Algorithms Graph Native Machine Learning Find the patterns you’re looking for in connected data. Use unsupervised machine learning techniques to identify associations, anomalies, and trends. Use embeddings to learn the features in your graph that you don’t even know are important yet. Train in-graph supervised ML models to predict links, labels, and missing data.
  • 6.
    Neo4j, Inc. Allrights reserved 2021 Neo4j’s Graph Data Science Framework Neo4j Graph Data Science Library Neo4j Database Neo4j Bloom Scalable Graph Algorithms & Analytics Workspace Native Graph Creation & Persistence Visual Graph Exploration & Prototyping
  • 7.
    Neo4j, Inc. Allrights reserved 2021 Robust Graph Algorithms & ML methods ● Compute metrics about the topology and connectivity ● Build predictive models to enhance your graph ● Highly parallelized and scalable 7 The Neo4j GDS Library Mutable In-Memory Workspace Computational Graph Native Graph Store Efficient & Flexible Analytics Workspace ● Automatically reshapes transactional graphs into an in-memory analytics graph ● Optimized for global traversals and aggregation ● Create workflows and layer algorithms ● Store and manage predictive models in the model catalog
  • 8.
    Neo4j, Inc. Allrights reserved 2021 Our Secret Sauce: The In-Memory Graph • Neo4j automates data transformations • Experiment with different data sets, data models • Mutable representation to chain operations • Production ready features, parallelization & enterprise support • Ability to persist and version data GDS is fast and scalable because we transform your transactional graph into a custom built data structure, optimized for parallel processing Mutable In-Memory Workspace Computational Graph Native Graph Store
  • 9.
    9 How does GDSrun at scale?
  • 10.
    Neo4j, Inc. Allrights reserved 2021 10 Outline 1 2 Architecture How do you get data in? And what’s the right model? Enterprise The enterprise edition of GDS includes critical features for scale. 3 4 Algorithms Some algorithms are better choices than others when you’ve got a lot of data. Case studies Let’s talk about some customers with big data sets - and how they use GDS.
  • 11.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 11 Architecture
  • 12.
    Neo4j, Inc. Allrights reserved 2021 12 GDS runs large dedicated single instances that execute algorithm only workloads • Data is typically imported from data lake or data warehouse • Data is updated in batches on some regular interval • Output used for offline scoring or manual review by analytics teams GDS does not run on a cluster 12 Architecture: GDS Instances Cluster Standalone GDS Instance
  • 13.
    Neo4j, Inc. Allrights reserved 2021 13 Architecture: Database Sizing GDS requirements: ● The amount of memory (heap) available determines if something can run ● The number of CPUs determines how fast something will run Use estimator functions to know how much memory you need for your workflow Don’t forget about the High Limit store format for really big datasets
  • 14.
    Neo4j, Inc. Allrights reserved 2021 Read replicas for data science workflows can be used as: • Analytics instances with dedicated capacity for querying/reporting without interrupting algorithms • Visualization server for bloom • Warm backup for disaster recovery Architecture: Read Replicas
  • 15.
    Neo4j, Inc. Allrights reserved 2021 15 Architecture: Data Import Use Case Requirements Fastest method: Load data into an empty database using admin-import For deltas: Consider how often you need to load data and use apoc.periodic.iterate
  • 16.
    Neo4j, Inc. Allrights reserved 2021 16 Architecture: Data Models Choose a data model fit for the algorithms you want to run Most graph algorithms expect monopartite graphs, but some expect multipartite graphs Or that can be manipulated using native graph loaders E.g. collapsePath to create a monopartite graph Monopartite graph Multipartite graph
  • 17.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 17 Enterprise Edition
  • 18.
    Neo4j, Inc. Allrights reserved 2021 18 Enterprise-Only Features for Scalability GDS Enterprise Edition is built for scale and will maximize your odds for success! GDS EE uses a special graph compression technique that uses up to 75% less memory than community edition. GDS algorithms are parallelized: EE lets you set concurrency > 4, so your algorithms compute as quickly as possible. Enterprise Graph Compression Unlimited Parallelization
  • 19.
    Neo4j, Inc. Allrights reserved 2021 19 Our Implementations are Fast - and Getting Faster LDBC100 Benchmark (LDBC Social Network Scale Factor 100) 300M+ nodes 2B+ relationships LDBC100PKP (LDBC Social Network Scale Factor 100) 500k nodes 46M+ relationships Logical Cores: 64 Memory: 512GB Storage: 600GB NVMe-SSD AWS EC2 R5D16XLarge Intel Xeon Platinum 8000 (Skylake-SP or Cascade Lake) Node Similarity 20min Betweenness Centrality 10min Node2Vec 2.8min Label Propagation 46sec Weakly Connected Components 36sec Triangle Counting 24.8min Local Clustering Coefficient 4.76min FastRP 1.33min PageRank 53sec Louvain 14.66min
  • 20.
    Neo4j, Inc. Allrights reserved 2021 20 Parallel Processing Means Better Performance
  • 21.
    Neo4j, Inc. Allrights reserved 2021 21 Parallel Processing Means Better Performance
  • 22.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 22 Algorithms
  • 23.
    Neo4j, Inc. Allrights reserved 2021 Choosing Algorithms: Complexity Consider computational complexity when choosing algorithms Examples: • Betweenness Centrality: traverses nodes multiple times • All Pairs: traverses multiple paths in the graph • Node Similarity: compares every node with every other node
  • 24.
    Neo4j, Inc. Allrights reserved 2021 Choosing Algorithms: Substitutions Node2Vec and graphSAGE are easy to understand but memory intensive. GraphSAGE or Node2Vec Node Similarity has been optimized, but it’s still computationally intensive. Node Similarity Everyone loves Louvain for finding fraud rings, but it doesn’t parallelize linearly. Louvain Instead of FastRP and FastRPExtended can calculate results for millions of nodes in seconds, and perform well! FastRP KNN is an approximate nearest neighbors algorithm and you can adjust the sampling rate for speed. KNN Label propagation uses a much faster algorithm - that parallelizes well - to find communities. Label Propagation Choose
  • 25.
    Neo4j, Inc. Allrights reserved 2021 Running Algorithms: Native Projection Native projections are orders of magnitude faster than cypher projections Techniques for native projections: • collapsePath:updates your in memory graph to traverse a specified pattern and create relationships between start and end nodes • Relationship aggregations • Graph filtering gds.beta.graph.create.subgraph
  • 26.
    Neo4j, Inc. Allrights reserved 2021 Running Algorithms: Named Graphs Use named graphs, not anonymous graphs with gds.graph.create Advantages: • Decouples graph loading, algorithm execution, and writeback • Can run more than one algorithm without loading each time
  • 27.
    Neo4j, Inc. Allrights reserved 2021 Running Algorithms: Pre-processing Use subgraph filtering to preprocess your data: graph.create.subgraph Use cases: • Remove dense nodes that slow calculations • Remove orphan nodes that are uninformative • Isolate communities and execute algorithms on multiple subgraphs
  • 28.
    Neo4j, Inc. Allrights reserved 2021 Running Algorithms: Concurrency Don’t forget about concurrency! Every algorithm supports the concurrency parameter
  • 29.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 29 Case Studies
  • 30.
    Neo4j, Inc. Allrights reserved 2021 30 Client: Top Media Conglomerate Graph: Cookie graph with tens of billions of nodes and hundreds of billions of relationships ● High limit store format ● Data model: simple ● Import: daily data refresh from data warehouse ● Workflow: only one algorithm (WCC) that runs daily Identity Disambiguation
  • 31.
    Neo4j, Inc. Allrights reserved 2021 Client: Top Retailer Graph: Insights graph with hundreds of millions of nodes and more than a billion relationships ● Data model: complex, heterogeneous nodes and relationship types ● Import: periodic data load from data warehouse ● Workflow: ○ Offline analysis ○ Generating graph embeddings using heterogeneous nodes and more than one relationship type, which requires a pipeline with multiple algorithms chained together Search Relevance and Product Recommendations
  • 32.
    Neo4j, Inc. Allrights reserved 2021 Client: Top Video Streaming Platform Graph: Customer event tracking graph with billions of nodes and tens of billions of relationships ● Data model: complex, heterogeneous nodes and relationship types ● Import: monthly data refresh ● Workflow: ○ Offline analysis ○ Requires a pipeline with multiple algorithms chained together Customer Journey
  • 33.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 33 Unmatched Power Continually adding more graph algorithms, embeddings, & in-graph ML Extensible Integrate with other data sources and ML platforms Streamlined In-platform transformations and reshaping for fast iteration Scalable Data Science Customers in production with over 10’s billions of nodes Strongest Community 220K+ practioners 72K+ meetups Flexible Deployment On-prem or in the Cloud
  • 34.
    Neo4j, Inc. Allrights reserved 2021 Neo4j, Inc. All rights reserved 2021 34 Questions?