Improve ML Predictions using Connected Feature Extraction

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Amy Hodler, Neo4j
Improve ML Predictions using
Connected Features
#Neo4j
#GraphAnalytics
#UnifiedAnalytics #SparkAISummit

The Next 20 Minutes
• Graphs for Predictions
• Link Prediction
• Neo4j + Spark Workflow
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Amy E. Hodler
Graph Analytics & AI Program Manager, Neo4j
Amy.Hodler@neo4j.com @amyhodler
neo4j.com/
graph-algorithms-book
Chapter 8: Link Prediction
Spark & Neo4j

Relationships Are often
the Strongest Predictors of Behavior
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”

ML
How Graph Technology is Changing AI
4:30 PM Room 2002

Features for ML:
Feature Extraction
Feature Extraction is how when we change the shape or format of
the data to be usable in a machine learning pipeline. For example,
from a graph, we extract the relevant subset of the data into a
tabular format for model building.

Features for ML:
Feature Engineering
Feature Engineering is how we combine and process the data to
create new, more meaningful features, such as clustering or
connectivity metrics.
Influence
Connectivity
Communities
Relationships

Features for ML:
Feature Selection
Feature Selection is how we reduce the number of features used
in a model to a relevant subset. This can be done algorithmically or
based on domain expertise, but the objective is to maximize the
predictive power of your model while minimizing overfitting.

Stop Throwing Away Data You Already Have
Decisions
$
Better Decisions
Machine Learning Pipeline Machine Learning Pipeline

Can we infer which new interactions are likely to occur
in the future?

+ 50 years of biomedical
data integrated in a
knowledge graph
Predicting new uses for
drugs by using the graph
structure to create features
for link prediction
16
het.io

het.io
17

Link Prediction Methods
Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to
predict a link between nodes
Machine Learning
Use the measures as features to
train an ML model
Community
Detection
Link
Prediction
Similarity
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0

Example:
Predicting Collaboration

Predicting Collaboration with a
Graph Enhanced ML Model
• Citation Network Dataset - Research Dataset
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
– “ArnetMiner: Extraction and Mining of Academic Social
Networks”, by J. Tang et al
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier

Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate
Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble
method

Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment measure the
closeness of nodes based on shared neighbors
Common Neighbors measures the number of
possible neighbors (triadic closure)
Illustration from be.amazd.com/link-prediction/

Triangle counting and clustering coefficients
measure the density of connections around nodes
Louvain Modularity identifies interacting
communities and hierarchies
Graph Algorithms Used for
Feature Engineering (few examples)

Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.

OMG I’m Good!
Data Leakage!
We had to go back and use time-
based splits for train/test datasets
Did you get really high accuracy
on your first run without tuning?

Feature Influence for Tuning
To compute feature
importance, the random forest
algorithm in Spark averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated

Resources
Code/Repositories:
This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Python notebook:
github.com/AliciaFrame/
Public-Python-Notebooks
neo4j.com/
graph-algorithms-book
Chapter 8: Link Prediction

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Amy.Hodler@neo4j.com

Extra for Q&A

Resources
Spark Community
• spark.apache.org/community.html
• users@spark.apache.org
Code/Repositories
This example from O’Reilly Book:
bit.ly/2FPgGVV (ML Folder)
Python notebook:
github.com/AliciaFrame/
Public-Python-Notebooks
Neo4j Community
• neo4j.com/developer/
• neo4j.com/developer/graph-algorithms/
• community.neo4j.com

CAR
DRIVES
name: “Dan”
born: May 29, 1970
twitter: “@dan”
name: “Ann”
born: Dec 5, 1975
since:
Jan 10, 2011
brand: “Volvo”
model: “V70”
Latitude: 37.5629900°
Longitude: -122.3255300°
Nodes
• Can have Labels to classify nodes
• Labels have native indexes
Relationships
• Relate nodes by type and direction
Properties
• Attributes of Nodes & Relationships
• Stored as Name/Value pairs
• Can have indexes and composite indexes
• Visibility security by user/role
Neo4j Invented the Labeled Property Graph Model
MARRIED TO
LIVES WITH
OW
NS
PERSON PERSON
33

neo4j.com/graph-algorithms-book
Free O’Reilly Book
Spark and Neo4j Examples
Chapter 8: Machine Learning
Visit the Neo4j Booth

Improve ML Predictions using Connected Feature Extraction

More Related Content

Similar to Improve ML Predictions using Connected Feature Extraction

More from Databricks

Recently uploaded

Improve ML Predictions using Connected Feature Extraction