WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Amy Hodler, Neo4j
Improve ML Predictions using
Connected Features
#Neo4j
#GraphAnalytics
#UnifiedAnalytics #SparkAISummit
The Next 20 Minutes
• Graphs for Predictions
• Link Prediction
• Neo4j + Spark Workflow
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Amy E. Hodler
Graph Analytics & AI Program Manager, Neo4j
Amy.Hodler@neo4j.com @amyhodler
neo4j.com/
graph-algorithms-book
Chapter 8: Link Prediction
Spark & Neo4j
What in Common is Predictive?
Relationships Are often
the Strongest Predictors of Behavior
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
Graph Data Science Use Cases
ML
How Graph Technology is Changing AI
4:30 PM Room 2002
Connected Features
Features for ML:
Feature Extraction
Feature Extraction is how when we change the shape or format of
the data to be usable in a machine learning pipeline. For example,
from a graph, we extract the relevant subset of the data into a
tabular format for model building.
Features for ML:
Feature Engineering
Feature Engineering is how we combine and process the data to
create new, more meaningful features, such as clustering or
connectivity metrics.
Influence
Connectivity
Communities
Relationships
Features for ML:
Feature Selection
Feature Selection is how we reduce the number of features used
in a model to a relevant subset. This can be done algorithmically or
based on domain expertise, but the objective is to maximize the
predictive power of your model while minimizing overfitting.
Stop Throwing Away Data You Already Have
Decisions
$
Better Decisions
Machine Learning Pipeline Machine Learning Pipeline
Link Prediction
Can we infer which new interactions are likely to occur
in the future?
#UnifiedAnalytics #SparkAISummit
+ 50 years of biomedical
data integrated in a
knowledge graph
Predicting new uses for
drugs by using the graph
structure to create features
for link prediction
16
het.io
#UnifiedAnalytics #SparkAISummit
het.io
17
Link Prediction Methods
Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to
predict a link between nodes
Machine Learning
Use the measures as features to
train an ML model
Community
Detection
Link
Prediction
Similarity
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Example:
Predicting Collaboration
Predicting Collaboration with a
Graph Enhanced ML Model
• Citation Network Dataset - Research Dataset
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
– “ArnetMiner: Extraction and Mining of Academic Social
Networks”, by J. Tang et al
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate
Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble
method
Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment measure the
closeness of nodes based on shared neighbors
Common Neighbors measures the number of
possible neighbors (triadic closure)
Illustration from be.amazd.com/link-prediction/
Triangle counting and clustering coefficients
measure the density of connections around nodes
Louvain Modularity identifies interacting
communities and hierarchies
Graph Algorithms Used for
Feature Engineering (few examples)
Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.
OMG I’m Good!
Data Leakage!
We had to go back and use time-
based splits for train/test datasets
Did you get really high accuracy
on your first run without tuning?
Results
FirstModel
Results
FirstModelLastModel
Feature Influence for Tuning
To compute feature
importance, the random forest
algorithm in Spark averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
Resources
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Code/Repositories:
This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Python notebook:
github.com/AliciaFrame/
Public-Python-Notebooks
neo4j.com/
graph-algorithms-book
Chapter 8: Link Prediction
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Amy.Hodler@neo4j.com
Extra for Q&A
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Resources
Spark Community
• spark.apache.org/community.html
• users@spark.apache.org
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Code/Repositories
This example from O’Reilly Book:
bit.ly/2FPgGVV (ML Folder)
Python notebook:
github.com/AliciaFrame/
Public-Python-Notebooks
Neo4j Community
• neo4j.com/developer/
• neo4j.com/developer/graph-algorithms/
• community.neo4j.com
CAR
DRIVES
name: “Dan”
born: May 29, 1970
twitter: “@dan”
name: “Ann”
born: Dec 5, 1975
since:
Jan 10, 2011
brand: “Volvo”
model: “V70”
Latitude: 37.5629900°
Longitude: -122.3255300°
Nodes
• Can have Labels to classify nodes
• Labels have native indexes
Relationships
• Relate nodes by type and direction
Properties
• Attributes of Nodes & Relationships
• Stored as Name/Value pairs
• Can have indexes and composite indexes
• Visibility security by user/role
Neo4j Invented the Labeled Property Graph Model
MARRIED TO
LIVES WITH
OW
NS
PERSON PERSON
33
ML Model - Random Forest
neo4j.com/graph-algorithms-book
Free O’Reilly Book
Spark and Neo4j Examples
Chapter 8: Machine Learning
Visit the Neo4j Booth

Improve ML Predictions using Connected Feature Extraction

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Amy Hodler, Neo4j ImproveML Predictions using Connected Features #Neo4j #GraphAnalytics #UnifiedAnalytics #SparkAISummit
  • 3.
    The Next 20Minutes • Graphs for Predictions • Link Prediction • Neo4j + Spark Workflow #UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics Amy E. Hodler Graph Analytics & AI Program Manager, Neo4j Amy.Hodler@neo4j.com @amyhodler neo4j.com/ graph-algorithms-book Chapter 8: Link Prediction Spark & Neo4j
  • 4.
    What in Commonis Predictive?
  • 5.
    Relationships Are often theStrongest Predictors of Behavior “Increasingly we're learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves”
  • 7.
  • 8.
    ML How Graph Technologyis Changing AI 4:30 PM Room 2002
  • 9.
  • 10.
    Features for ML: FeatureExtraction Feature Extraction is how when we change the shape or format of the data to be usable in a machine learning pipeline. For example, from a graph, we extract the relevant subset of the data into a tabular format for model building.
  • 11.
    Features for ML: FeatureEngineering Feature Engineering is how we combine and process the data to create new, more meaningful features, such as clustering or connectivity metrics. Influence Connectivity Communities Relationships
  • 12.
    Features for ML: FeatureSelection Feature Selection is how we reduce the number of features used in a model to a relevant subset. This can be done algorithmically or based on domain expertise, but the objective is to maximize the predictive power of your model while minimizing overfitting.
  • 13.
    Stop Throwing AwayData You Already Have Decisions $ Better Decisions Machine Learning Pipeline Machine Learning Pipeline
  • 14.
  • 15.
    Can we inferwhich new interactions are likely to occur in the future?
  • 16.
    #UnifiedAnalytics #SparkAISummit + 50years of biomedical data integrated in a knowledge graph Predicting new uses for drugs by using the graph structure to create features for link prediction 16 het.io
  • 17.
  • 18.
    Link Prediction Methods AlgorithmMeasures Run targeted algorithms and score outcomes Set a threshold value used to predict a link between nodes Machine Learning Use the measures as features to train an ML model Community Detection Link Prediction Similarity 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0
  • 19.
  • 20.
    Predicting Collaboration witha Graph Enhanced ML Model • Citation Network Dataset - Research Dataset – Used a subset with 52K papers, 80K authors, 140K author relationships and 29K citation relationships – “ArnetMiner: Extraction and Mining of Academic Social Networks”, by J. Tang et al • Neo4j – Create a co-authorship graph and connected feature engineering • Spark and MLlib – Train and test our model using a random forest classifier
  • 21.
    Our Link PredictionWorkflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identified sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Precision, Accuracy, Recall ROC Curve & AUC Model Selection: Random Forest Ensemble method
  • 22.
    Graph Algorithms Usedfor Feature Engineering (few examples) Preferential Attachment measure the closeness of nodes based on shared neighbors Common Neighbors measures the number of possible neighbors (triadic closure) Illustration from be.amazd.com/link-prediction/
  • 23.
    Triangle counting andclustering coefficients measure the density of connections around nodes Louvain Modularity identifies interacting communities and hierarchies Graph Algorithms Used for Feature Engineering (few examples)
  • 24.
    Training Our Model Thisis one decision tree in our Random Forest used as a binary classifier to learn how to classify a pair: predicting either linked or not linked.
  • 25.
    OMG I’m Good! DataLeakage! We had to go back and use time- based splits for train/test datasets Did you get really high accuracy on your first run without tuning?
  • 26.
  • 27.
  • 28.
    Feature Influence forTuning To compute feature importance, the random forest algorithm in Spark averages the reduction in impurity across all trees in the forest Feature rankings are in comparison to the group of features evaluated
  • 29.
    Resources #UnifiedAnalytics #SparkAISummit #Neo4j#GraphAnalytics Code/Repositories: This example from O’Reilly book bit.ly/2FPgGVV (ML Folder) Python notebook: github.com/AliciaFrame/ Public-Python-Notebooks neo4j.com/ graph-algorithms-book Chapter 8: Link Prediction
  • 30.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT Amy.Hodler@neo4j.com
  • 31.
    Extra for Q&A #UnifiedAnalytics#SparkAISummit #Neo4j #GraphAnalytics
  • 32.
    Resources Spark Community • spark.apache.org/community.html •users@spark.apache.org #UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics Code/Repositories This example from O’Reilly Book: bit.ly/2FPgGVV (ML Folder) Python notebook: github.com/AliciaFrame/ Public-Python-Notebooks Neo4j Community • neo4j.com/developer/ • neo4j.com/developer/graph-algorithms/ • community.neo4j.com
  • 33.
    CAR DRIVES name: “Dan” born: May29, 1970 twitter: “@dan” name: “Ann” born: Dec 5, 1975 since: Jan 10, 2011 brand: “Volvo” model: “V70” Latitude: 37.5629900° Longitude: -122.3255300° Nodes • Can have Labels to classify nodes • Labels have native indexes Relationships • Relate nodes by type and direction Properties • Attributes of Nodes & Relationships • Stored as Name/Value pairs • Can have indexes and composite indexes • Visibility security by user/role Neo4j Invented the Labeled Property Graph Model MARRIED TO LIVES WITH OW NS PERSON PERSON 33
  • 34.
    ML Model -Random Forest
  • 36.
    neo4j.com/graph-algorithms-book Free O’Reilly Book Sparkand Neo4j Examples Chapter 8: Machine Learning Visit the Neo4j Booth