Improve ML Predictions using Graph
Algorithms
Mark Needham, Neo4j
Amy Hodler, Neo4j
May 2019
#Neo4j
#GraphAnalytics
• Graphs for Predictions
• Connected Features
• Link Prediction
• Neo4j + Spark Workflow
Amy E. Hodler
Graph Analytics & AI Program
Manager, Neo4j
Amy.Hodler@neo4j.com
@amyhodler
neo4j.com/
graph-algorithms-book
Chapter 8: Graph + ML
Spark & Neo4j
Mark Needham
Developer Relations Engineer,
Neo4j
Mark.needham@neo4j.com
@markHneedham
What in Common is Predictive?
Relationships:
Strongest Predictors of Behavior!
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
James Fowler David Burkus
James Fowler
Albert-Laszlo
Barabasi
Native Graph Platforms are Designed for Connected Data
TRADITIONAL
PLATFORMS
BIG DATA
TECHNOLOGY
Store and retrieve data Aggregate and filter data Connections in data
Real time storage & retrieval Real-Time Connected Insights
Long running queries
aggregation & filtering
“Our Neo4j solution is literally thousands of times faster
than the prior MySQL solution, with queries that require
10-100 times less code”
Volker Pacher, Senior Developer
Max # of hops ~3
Millions
Graph Database Surging in Popularity
Trends since Jan 2013
DB-Engines.com
Graph Data Science Applications
• Current data science models ignore network structure & complex relationships
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
Novel & More Accurate Predictions
with the Data You Already Have
Machine Learning Pipeline
Connected Features
Connection-related metrics about our graph, such
as the number of relationships going into or out of
nodes, a count of potential triangles, or neighbors in
common.
14c
What Are Connected Features?
Query (e.g. Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms Libraries
Global analysis
and iterations
You know what you’re looking
for and making a decision
You’re learning the overall structure of a
network, updating data, and predicting
Local
Patterns
Global
Computation
Deriving Connected Features
Connected Feature Engineering
Feature Engineering is how we combine and process the data to create new,
more meaningful features, such as clustering or connectivity metrics.
Add More Descriptive Features:
- Influence
- Relationships
- Communities
Extraction
17
Graph Feature Categories & Algorithms
Pathfinding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike nodes
are
Similarity
Embeddings
Learned representations
of connectivity or topology
Link Prediction
19
Can we infer new interactions in the future?
What unobserved facts we’re missing?
+ 50 years of biomedical data
integrated in a knowledge
graph
Predicting new uses for drugs
by using the graph structure to
create features for link
prediction
Example: het.io
Example: het.io
Methods for Link Prediction
Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to predict a
link between nodes
Machine Learning
Use the measures as features to train an
ML model
Community
Detection
Link
Prediction
Similarity
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Centrality
Example:
Predicting Collaboration
• Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”, by
J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
24
Predicting Collaboration
with a Graph Enhanced ML Model
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment measure the closeness of
nodes based on shared neighbors
Common Neighbors measures the number of possible
neighbors (triadic closure)
Illustration be.amazd.com/link-prediction/
Graph Algorithms Used for
Feature Engineering (few examples)
Triangle counting and clustering coefficients measure the
density of connections around nodes
Louvain Modularity identifies interacting communities and
hierarchies
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
31
32
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0
33
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0
Train
Test
OMG I’m Good!
Data Leakage!
Graph metric computation for the train set
touches data from the test set.
Did you get really high accuracy on your first
run without tuning?
Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
< 2006
>= 2006
Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
Class Imbalance
Negative
Examples
Positive
Examples
There are significantly more negative examples than positive ones:
# negative examples = (# nodes)² - (# relationships) - (# nodes)
38
Class Imbalance
A very high accuracy model could predict that a pair of nodes are not linked.
39
Class Imbalance
Class Imbalance
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Model Selection:
Random Forest
Ensemble
method
Picking a Classifier
Training Our Model
This is one decision tree in our
Random Forest used as a binary
classifier to learn how to classify a
pair: predicting either linked or not
linked.
4 Models Trained
with Multiple Graph Features
Graph Features:
• Common Authors
“Graphy”
Model
Common Authors
Model
Triangles
Model
Community
Model
Graph Features:
• Preferential
Attachment
• Total Neighbors
Graph Features:
• Min & Max Triangles
• Min & Max
Clustering
Coefficient
Graph Features:
• Label Propagation
• Louvain Modularity
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble
method
Measures
Accuracy Proportion of total correct predictions.
Beware of skewed data!
Precision Proportion of positive predictions that
are correct.
Low score = more false positives
Recall /
True Positive Rate
Proportion of actual positives that are
correct.
Low score = more false negatives
False Positive Rate Proportion of incorrect positives
ROC Curve & AUC X-Y Chart mapping above 2 metrics
(TPR and FPR) with area under curve
Result: First Model ROC & AUC
Problematic False Positives!
Common Authors
Model 1
Result: All Models Common Authors
Model 1
Community
Model 4
Iteration & Tuning: Feature Influence
For feature importance, the Spark
random forest averages the
reduction in impurity across all
trees in the forest
Feature rankings are in comparison
to the group of features evaluated
Also try PageRank!
Try removing different features
(LabelPropagation)
Graph Machine Learning Workflow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision, accuracy,
recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
Resources
• neo4j.com/sandbox
• neo4j.com/developer/
graph-algorithms/
• community.neo4j.com
Data & Code:
• This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Amy.Hodler@neo4j.com
@amyhodler
neo4j.com/
graph-algorithms-book
Q&A/Extra Stuff to delete
52
53
Connected Feature Extraction
Feature Extraction is how when we change the shape or format of the data
to be usable in a machine learning pipeline. For example, from a graph, we
extract the relevant subset of the data into a tabular format for model
building.
Connected Feature Selection
Feature Selection is how we reduce the number of features used in a model
to a relevant subset. This can be done algorithmically or based on domain
expertise, but the objective is to maximize the predictive power of your
model while minimizing overfitting.
720+
7/10
12/2
5
8/10
53K+
100+
300+
450+
Adoption
Top Retail Firms
Top Financial Firms
Top Software Vendors
Customers Partners
• Creator of the Neo4j Graph Platform
• ~250 employees
• HQ in Silicon Valley, other offices include
London, Munich, Paris and Malmö Sweden
• $80M new funding led by Morgan Stanley &
One Peak. Total $160M from Fidelity,
Sunstone, Conor, Creandum, and
Greenbridge Capital
• Over 15M+ downloads & container pulls
• 325+ enterprise subscription customers
with over half with >$1B in revenue
Ecosystem
Startups in program
Enterprise customers
Partners
Meet up members
Events per year
Industry’s Largest Dedicated Investment in Graphs
Neo4j - The Graph Company
Strictly ConfidentialStrictly Confidential
56
Helping The World To Make Sense of Data
ICIJ used Neo4j to uncover the
world’s largest journalistic leak to
date, The Panama Papers
NASA uses Neo4j for a “Lessons
Learned” database to improve
effectiveness in search missions in
space
Neo4j is used to graph the human
body, map correlations, identify cause
& effect and search for the cure for
cancer
SAVING DEMOCRACY
MISSION TO
MARS
CURING CANCER
Graph and ML Algorithms in Neo4j
• Parallel Breadth First Search & DFS
• Shortest Path
• Single-Source Shortest Path
• All Pairs Shortest Path
• Minimum Spanning Tree
• A* Shortest Path
• Yen’s K Shortest Path
• K-Spanning Tree (MST)
• Random Walk
• Degree Centrality
• Closeness Centrality
• CC Variations: Harmonic, Dangalchev,
Wasserman & Faust
• Betweenness Centrality
• Approximate Betweenness Centrality
• PageRank
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Triangle Count
• Clustering Coefficients
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity – 1 Step & Multi-Step
• Balanced Triad (identification)
• Euclidean Distance
• Cosine Similarity
• Jaccard Similarity
• Overlap Similarity
• Pearson Similarity
Pathfinding
& Search
Centrality /
Importance
Community
Detection
Similarity
neo4j.com/docs/
graph-algorithms/current/
Updated April 2019
Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors
Conceive
Code
Compute
Store
Non-Native Graph DBNative Graph DB
RDBM
S
Optimized for graph workloads
Connectedness Differentiates Neo4j
Neo4j is an enterprise-grade native graph platform that enables you to:
• Store, reveal and query data relationships
• Traverse and analyze any levels of depth in real-time
• Add context and connect new data on the fly
59
Who We Are: Leader in Graph Innovations
• Performance
• ACID Transactions
• Schema-free Agility
• Graph Algorithms
Designed, built and tested natively
for graphs from the start for:
• Developer Productivity
• Hardware Efficiency
• Global Scale
• Graph Adoption
Graph
Transactions
Graph
Analytics
Data Integration
Development
& Admin
Analytics
Tooling
Drivers & APIs Discovery & Visualization
60
• Record “Cyber Monday” sales
• About 35M daily transactions
• Each transaction is 3-22 hops
• Queries executed in 4ms or less
• Replaced IBM Websphere commerce
• 300M pricing operations per day
• 10x transaction throughput on half the
hardware compared to Oracle
• Replaced Oracle database
• Large postal service with over 500k
employees
• Neo4j routes 7M+ packages daily at peak,
with peaks of 5,000+ routing operations per
second.
Handling Large Graph Work Loads for Enterprises
Real-time promotion
recommendations
Marriott’s Real-time
Pricing Engine
Handling Package
Routing in Real-Time
Recommendations Dynamic Pricing IoT-applicationsFraud Detection
Real-Time Transaction Applications
Generate and
Protect Revenue
Customer
Engagement
Metadata and Advanced Analytics
Data Lake
Integration
Knowledge
Graphs for AI
Risk
Mitigation
Generate
Actionable Insights
Network
Management
Supply Chain
Efficiency
Identity and Access
Management
Internal Business Processes
Improve Efficiency
and Cut Costs
Graph Use Cases by Value Proposition
Softwar
e
Financial
Services Teleco
m
Retail &
Consumer Goods
Media &
Entertainment Other Industries
Airbus
62 Copyright © 2017 Neo4j, Inc. Company Confidential
Graph
Transactions
Graph
Analytics
Data Integration
Development
& Admin
Analytics
Tooling
Drivers & APIs Discovery & Visualization
Developers
Admins
Applications Business Users
Data Analysts
Data Scientists
Enterprise Data Hub
Native Graph Platform: Tools for Many Users
Collections-Focused
Multi-Model, Documents, Columns
& Simple Tables, Joins
Neo4j is designed for data relationships
Different Paradigms
NoSQL
Relational
DBMS
Neo4j Graph
Platform
Connections-Focused
Focused on
Data Relationships
Development Benefits
Easy model maintenance
Easy query
Deployment Benefits
Ultra high performance
Minimal resource usage
How Neo4j Fits — Common Architecture Patterns
From Disparate Silos
To Cross-Silo Connections
From Tabular Data
To Connected Data
From Data Lake Analytics
to Real-Time Operations
Cypher: Powerful & Expressive Query Language
MATCH (:Person { name:“Dan”} ) -[:MARRIED_TO]-> (spouse)
MARRIED_TO
Dan Ann
NODE RELATIONSHIP TYPE
LABEL PROPERTY VARIABLE
Neo4j Bloom
67
• High fidelity
• Scene navigation
• Property views
• Search suggestions
• Saved phrase history
• Property editor
• Schema perspectives
• Bloom chart type
• Visualize
• Communicate
• Discover
• Navigate
• Isolate
• Edit
• Share
68
Real-Time
Recommendations
Fraud
Detection
Network &
IT Operations
Master Data
Management
Knowledge
Graph
Identity & Access
Management
Common Graph Technology Use Cases
AirBnb
Graphs Drive Innovation
69
Context Paths
Auto-Graphs
Graph Layers
1st Graph
Cross-
Connect
Cross-tech applications
Internet of Things
operations
Transparent Neural
Networks
Blockchain-managed
systems
Adjacent graph layers
inspire new innovations
Metadata / Risk
Management
Knowledge Graphs
AI- Powered Customer
Experiences
Connect unlike objects
such as people to products,
locations
Mobile app explosion
Recommendation engines
Fraud detectors
Desire for more context to
follow connections
Connects like objects
People, computer
networks, telco, etc
Business Problem
• Find relationships between people, accounts, shell companies
and offshore accounts
• Journalists are non-technical
• Biggest “Snowden-Style” document leak ever; 11.5 million
documents, 2.6TB of data
Solution and Benefits
• Pulitzer Prize winning investigation resulted in robust
coverage of fraud and corruption
• PM of Iceland & Pakistan resigned, exposed Putin, Prime
Ministers, gangsters, celebrities (Messi)
• Led to assassination of journalist in Malta
Background
• International Consortium of Investigative Journalists (ICIJ),
small team of data journalists
• International investigative team specializing in cross-border
crime, corruption and accountability of power
• Works regularly with leaks and large datasets
ICIJ Panama Papers INVESTIGATIVE JOURNALISM
Fraud Detection / Knowledge Graph70
Thomson Reuters Graph
71
• Data Fusion for Portfolio
Managers
• Graph layers
Background
• Personal shopping assistant
• Converses with buyer via text, picture and voice
to provide real-time recommendations
• Combines AI and natural language understanding
(NLU) in Neo4j Knowledge Graph
• First of many apps in eBay's AI Platform
Business Problem
• Improve personal context in online shopping
• Transform buyer-provided context into ideal
purchase recommendations over social platforms
• "Feels like talking to a friend"
Solution and Benefits
• 3 developers, 8M nodes, 20M relationships
• Needed high-performance traversals to respond
to live customer requests
• Easy to train new algorithms and grow model
• Generating revenue since launch
eBay for Google Assistant ONLINE RETAIL
Knowledge Graph powers Real-Time Recommendations72
EE Customer since 2016 Q3
Background
• Over 7M citizens suffer from Diabetes
• Connecting over 400 researchers
• Incorporates over 50 databases, 100k’s of Excel
workbooks, 30 database of biological samples
• Sought to examine disease from as many angles as
possible.
Business Problem
• Genes are connected by proteins or to metabolites,
and patients are connected with their diets, etc…
• Needed to improve the utilization of immensely
technical data
• Needed to cater to doctors and researchers with
simple navigation, communication and connections
of the graph.
Solution and Benefits
• Dr. Alexander Jarasch, Head of Bioinformatics and
Data Management
• Scientists can conduct parallel research without
asking the same questions or repeating tests
• Built views like a liver sample knowledge graph
DZD - German Center for Diabetes Research
Medical Genomic Research73
EE Customer since 2016
Q4

Improving Machine Learning using Graph Algorithms

  • 1.
    Improve ML Predictionsusing Graph Algorithms Mark Needham, Neo4j Amy Hodler, Neo4j May 2019 #Neo4j #GraphAnalytics
  • 2.
    • Graphs forPredictions • Connected Features • Link Prediction • Neo4j + Spark Workflow Amy E. Hodler Graph Analytics & AI Program Manager, Neo4j Amy.Hodler@neo4j.com @amyhodler neo4j.com/ graph-algorithms-book Chapter 8: Graph + ML Spark & Neo4j Mark Needham Developer Relations Engineer, Neo4j Mark.needham@neo4j.com @markHneedham
  • 3.
    What in Commonis Predictive?
  • 4.
    Relationships: Strongest Predictors ofBehavior! “Increasingly we're learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves” James Fowler David Burkus James Fowler Albert-Laszlo Barabasi
  • 5.
    Native Graph Platformsare Designed for Connected Data TRADITIONAL PLATFORMS BIG DATA TECHNOLOGY Store and retrieve data Aggregate and filter data Connections in data Real time storage & retrieval Real-Time Connected Insights Long running queries aggregation & filtering “Our Neo4j solution is literally thousands of times faster than the prior MySQL solution, with queries that require 10-100 times less code” Volker Pacher, Senior Developer Max # of hops ~3 Millions
  • 6.
    Graph Database Surgingin Popularity Trends since Jan 2013 DB-Engines.com
  • 8.
    Graph Data ScienceApplications
  • 9.
    • Current datascience models ignore network structure & complex relationships • Graphs add highly predictive features to existing ML models • Otherwise unattainable predictions based on relationships Novel & More Accurate Predictions with the Data You Already Have Machine Learning Pipeline
  • 11.
  • 12.
    Connection-related metrics aboutour graph, such as the number of relationships going into or out of nodes, a count of potential triangles, or neighbors in common. 14c What Are Connected Features?
  • 13.
    Query (e.g. Cypher) Real-time,local decisioning and pattern matching Graph Algorithms Libraries Global analysis and iterations You know what you’re looking for and making a decision You’re learning the overall structure of a network, updating data, and predicting Local Patterns Global Computation Deriving Connected Features
  • 14.
    Connected Feature Engineering FeatureEngineering is how we combine and process the data to create new, more meaningful features, such as clustering or connectivity metrics. Add More Descriptive Features: - Influence - Relationships - Communities Extraction
  • 15.
    17 Graph Feature Categories& Algorithms Pathfinding & Search Finds the optimal paths or evaluates route availability and quality Centrality / Importance Determines the importance of distinct nodes in the network Community Detection Detects group clustering or partition options Heuristic Link Prediction Estimates the likelihood of nodes forming a relationship Evaluates how alike nodes are Similarity Embeddings Learned representations of connectivity or topology
  • 16.
  • 17.
    19 Can we infernew interactions in the future? What unobserved facts we’re missing?
  • 18.
    + 50 yearsof biomedical data integrated in a knowledge graph Predicting new uses for drugs by using the graph structure to create features for link prediction Example: het.io
  • 19.
  • 20.
    Methods for LinkPrediction Algorithm Measures Run targeted algorithms and score outcomes Set a threshold value used to predict a link between nodes Machine Learning Use the measures as features to train an ML model Community Detection Link Prediction Similarity 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Centrality
  • 21.
  • 22.
    • Citation NetworkDataset - Research Dataset – “ArnetMiner: Extraction and Mining of Academic Social Networks”, by J. Tang et al – Used a subset with 52K papers, 80K authors, 140K author relationships and 29K citation relationships • Neo4j – Create a co-authorship graph and connected feature engineering • Spark and MLlib – Train and test our model using a random forest classifier 24 Predicting Collaboration with a Graph Enhanced ML Model
  • 23.
    Our Link PredictionWorkflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize
  • 25.
    Our Link PredictionWorkflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identified sparse feature areas Feature Engineering: New graphy features
  • 26.
    Graph Algorithms Usedfor Feature Engineering (few examples) Preferential Attachment measure the closeness of nodes based on shared neighbors Common Neighbors measures the number of possible neighbors (triadic closure) Illustration be.amazd.com/link-prediction/
  • 27.
    Graph Algorithms Usedfor Feature Engineering (few examples) Triangle counting and clustering coefficients measure the density of connections around nodes Louvain Modularity identifies interacting communities and hierarchies
  • 28.
    Our Link PredictionWorkflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identified sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation
  • 29.
  • 30.
    32 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 24 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0
  • 31.
    33 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 24 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0 Train Test
  • 32.
    OMG I’m Good! DataLeakage! Graph metric computation for the train set touches data from the test set. Did you get really high accuracy on your first run without tuning?
  • 33.
    Train and TestGraphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 < 2006 >= 2006
  • 34.
    Train and TestGraphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1
  • 35.
  • 36.
    There are significantlymore negative examples than positive ones: # negative examples = (# nodes)² - (# relationships) - (# nodes) 38 Class Imbalance
  • 37.
    A very highaccuracy model could predict that a pair of nodes are not linked. 39 Class Imbalance
  • 38.
  • 39.
    Our Link PredictionWorkflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identified sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Model Selection: Random Forest Ensemble method
  • 40.
  • 41.
    Training Our Model Thisis one decision tree in our Random Forest used as a binary classifier to learn how to classify a pair: predicting either linked or not linked.
  • 42.
    4 Models Trained withMultiple Graph Features Graph Features: • Common Authors “Graphy” Model Common Authors Model Triangles Model Community Model Graph Features: • Preferential Attachment • Total Neighbors Graph Features: • Min & Max Triangles • Min & Max Clustering Coefficient Graph Features: • Label Propagation • Louvain Modularity
  • 43.
    Our Link PredictionWorkflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identified sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Precision, Accuracy, Recall ROC Curve & AUC Model Selection: Random Forest Ensemble method
  • 44.
    Measures Accuracy Proportion oftotal correct predictions. Beware of skewed data! Precision Proportion of positive predictions that are correct. Low score = more false positives Recall / True Positive Rate Proportion of actual positives that are correct. Low score = more false negatives False Positive Rate Proportion of incorrect positives ROC Curve & AUC X-Y Chart mapping above 2 metrics (TPR and FPR) with area under curve
  • 45.
    Result: First ModelROC & AUC Problematic False Positives! Common Authors Model 1
  • 46.
    Result: All ModelsCommon Authors Model 1 Community Model 4
  • 47.
    Iteration & Tuning:Feature Influence For feature importance, the Spark random forest averages the reduction in impurity across all trees in the forest Feature rankings are in comparison to the group of features evaluated Also try PageRank! Try removing different features (LabelPropagation)
  • 48.
    Graph Machine LearningWorkflow Data aggregation Create and store graphs Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identify uninteresting features Cleanse (outliers+) Feature engineering/ extraction Train / Test split Resample for meaningful representation (proportional, etc.) Precision, accuracy, recall (ROC curve & AUC) SME Review Cross-validation Model & variable selection Hyperparameter tuning Ensemble methods
  • 49.
    Resources • neo4j.com/sandbox • neo4j.com/developer/ graph-algorithms/ •community.neo4j.com Data & Code: • This example from O’Reilly book bit.ly/2FPgGVV (ML Folder) Amy.Hodler@neo4j.com @amyhodler neo4j.com/ graph-algorithms-book
  • 50.
  • 51.
    53 Connected Feature Extraction FeatureExtraction is how when we change the shape or format of the data to be usable in a machine learning pipeline. For example, from a graph, we extract the relevant subset of the data into a tabular format for model building.
  • 52.
    Connected Feature Selection FeatureSelection is how we reduce the number of features used in a model to a relevant subset. This can be done algorithmically or based on domain expertise, but the objective is to maximize the predictive power of your model while minimizing overfitting.
  • 53.
    720+ 7/10 12/2 5 8/10 53K+ 100+ 300+ 450+ Adoption Top Retail Firms TopFinancial Firms Top Software Vendors Customers Partners • Creator of the Neo4j Graph Platform • ~250 employees • HQ in Silicon Valley, other offices include London, Munich, Paris and Malmö Sweden • $80M new funding led by Morgan Stanley & One Peak. Total $160M from Fidelity, Sunstone, Conor, Creandum, and Greenbridge Capital • Over 15M+ downloads & container pulls • 325+ enterprise subscription customers with over half with >$1B in revenue Ecosystem Startups in program Enterprise customers Partners Meet up members Events per year Industry’s Largest Dedicated Investment in Graphs Neo4j - The Graph Company
  • 54.
    Strictly ConfidentialStrictly Confidential 56 HelpingThe World To Make Sense of Data ICIJ used Neo4j to uncover the world’s largest journalistic leak to date, The Panama Papers NASA uses Neo4j for a “Lessons Learned” database to improve effectiveness in search missions in space Neo4j is used to graph the human body, map correlations, identify cause & effect and search for the cure for cancer SAVING DEMOCRACY MISSION TO MARS CURING CANCER
  • 55.
    Graph and MLAlgorithms in Neo4j • Parallel Breadth First Search & DFS • Shortest Path • Single-Source Shortest Path • All Pairs Shortest Path • Minimum Spanning Tree • A* Shortest Path • Yen’s K Shortest Path • K-Spanning Tree (MST) • Random Walk • Degree Centrality • Closeness Centrality • CC Variations: Harmonic, Dangalchev, Wasserman & Faust • Betweenness Centrality • Approximate Betweenness Centrality • PageRank • Personalized PageRank • ArticleRank • Eigenvector Centrality • Triangle Count • Clustering Coefficients • Connected Components (Union Find) • Strongly Connected Components • Label Propagation • Louvain Modularity – 1 Step & Multi-Step • Balanced Triad (identification) • Euclidean Distance • Cosine Similarity • Jaccard Similarity • Overlap Similarity • Pearson Similarity Pathfinding & Search Centrality / Importance Community Detection Similarity neo4j.com/docs/ graph-algorithms/current/ Updated April 2019 Link Prediction • Adamic Adar • Common Neighbors • Preferential Attachment • Resource Allocations • Same Community • Total Neighbors
  • 56.
    Conceive Code Compute Store Non-Native Graph DBNativeGraph DB RDBM S Optimized for graph workloads Connectedness Differentiates Neo4j
  • 57.
    Neo4j is anenterprise-grade native graph platform that enables you to: • Store, reveal and query data relationships • Traverse and analyze any levels of depth in real-time • Add context and connect new data on the fly 59 Who We Are: Leader in Graph Innovations • Performance • ACID Transactions • Schema-free Agility • Graph Algorithms Designed, built and tested natively for graphs from the start for: • Developer Productivity • Hardware Efficiency • Global Scale • Graph Adoption Graph Transactions Graph Analytics Data Integration Development & Admin Analytics Tooling Drivers & APIs Discovery & Visualization
  • 58.
    60 • Record “CyberMonday” sales • About 35M daily transactions • Each transaction is 3-22 hops • Queries executed in 4ms or less • Replaced IBM Websphere commerce • 300M pricing operations per day • 10x transaction throughput on half the hardware compared to Oracle • Replaced Oracle database • Large postal service with over 500k employees • Neo4j routes 7M+ packages daily at peak, with peaks of 5,000+ routing operations per second. Handling Large Graph Work Loads for Enterprises Real-time promotion recommendations Marriott’s Real-time Pricing Engine Handling Package Routing in Real-Time
  • 59.
    Recommendations Dynamic PricingIoT-applicationsFraud Detection Real-Time Transaction Applications Generate and Protect Revenue Customer Engagement Metadata and Advanced Analytics Data Lake Integration Knowledge Graphs for AI Risk Mitigation Generate Actionable Insights Network Management Supply Chain Efficiency Identity and Access Management Internal Business Processes Improve Efficiency and Cut Costs Graph Use Cases by Value Proposition
  • 60.
    Softwar e Financial Services Teleco m Retail & ConsumerGoods Media & Entertainment Other Industries Airbus 62 Copyright © 2017 Neo4j, Inc. Company Confidential
  • 61.
    Graph Transactions Graph Analytics Data Integration Development & Admin Analytics Tooling Drivers& APIs Discovery & Visualization Developers Admins Applications Business Users Data Analysts Data Scientists Enterprise Data Hub Native Graph Platform: Tools for Many Users
  • 62.
    Collections-Focused Multi-Model, Documents, Columns &Simple Tables, Joins Neo4j is designed for data relationships Different Paradigms NoSQL Relational DBMS Neo4j Graph Platform Connections-Focused Focused on Data Relationships Development Benefits Easy model maintenance Easy query Deployment Benefits Ultra high performance Minimal resource usage
  • 63.
    How Neo4j Fits— Common Architecture Patterns From Disparate Silos To Cross-Silo Connections From Tabular Data To Connected Data From Data Lake Analytics to Real-Time Operations
  • 64.
    Cypher: Powerful &Expressive Query Language MATCH (:Person { name:“Dan”} ) -[:MARRIED_TO]-> (spouse) MARRIED_TO Dan Ann NODE RELATIONSHIP TYPE LABEL PROPERTY VARIABLE
  • 65.
    Neo4j Bloom 67 • Highfidelity • Scene navigation • Property views • Search suggestions • Saved phrase history • Property editor • Schema perspectives • Bloom chart type • Visualize • Communicate • Discover • Navigate • Isolate • Edit • Share
  • 66.
    68 Real-Time Recommendations Fraud Detection Network & IT Operations MasterData Management Knowledge Graph Identity & Access Management Common Graph Technology Use Cases AirBnb
  • 67.
    Graphs Drive Innovation 69 ContextPaths Auto-Graphs Graph Layers 1st Graph Cross- Connect Cross-tech applications Internet of Things operations Transparent Neural Networks Blockchain-managed systems Adjacent graph layers inspire new innovations Metadata / Risk Management Knowledge Graphs AI- Powered Customer Experiences Connect unlike objects such as people to products, locations Mobile app explosion Recommendation engines Fraud detectors Desire for more context to follow connections Connects like objects People, computer networks, telco, etc
  • 68.
    Business Problem • Findrelationships between people, accounts, shell companies and offshore accounts • Journalists are non-technical • Biggest “Snowden-Style” document leak ever; 11.5 million documents, 2.6TB of data Solution and Benefits • Pulitzer Prize winning investigation resulted in robust coverage of fraud and corruption • PM of Iceland & Pakistan resigned, exposed Putin, Prime Ministers, gangsters, celebrities (Messi) • Led to assassination of journalist in Malta Background • International Consortium of Investigative Journalists (ICIJ), small team of data journalists • International investigative team specializing in cross-border crime, corruption and accountability of power • Works regularly with leaks and large datasets ICIJ Panama Papers INVESTIGATIVE JOURNALISM Fraud Detection / Knowledge Graph70
  • 69.
    Thomson Reuters Graph 71 •Data Fusion for Portfolio Managers • Graph layers
  • 70.
    Background • Personal shoppingassistant • Converses with buyer via text, picture and voice to provide real-time recommendations • Combines AI and natural language understanding (NLU) in Neo4j Knowledge Graph • First of many apps in eBay's AI Platform Business Problem • Improve personal context in online shopping • Transform buyer-provided context into ideal purchase recommendations over social platforms • "Feels like talking to a friend" Solution and Benefits • 3 developers, 8M nodes, 20M relationships • Needed high-performance traversals to respond to live customer requests • Easy to train new algorithms and grow model • Generating revenue since launch eBay for Google Assistant ONLINE RETAIL Knowledge Graph powers Real-Time Recommendations72 EE Customer since 2016 Q3
  • 71.
    Background • Over 7Mcitizens suffer from Diabetes • Connecting over 400 researchers • Incorporates over 50 databases, 100k’s of Excel workbooks, 30 database of biological samples • Sought to examine disease from as many angles as possible. Business Problem • Genes are connected by proteins or to metabolites, and patients are connected with their diets, etc… • Needed to improve the utilization of immensely technical data • Needed to cater to doctors and researchers with simple navigation, communication and connections of the graph. Solution and Benefits • Dr. Alexander Jarasch, Head of Bioinformatics and Data Management • Scientists can conduct parallel research without asking the same questions or repeating tests • Built views like a liver sample knowledge graph DZD - German Center for Diabetes Research Medical Genomic Research73 EE Customer since 2016 Q4