Knowledge Graph for Machine Learning and Data Science

Thomas Cook
Sales Director, AnzoGraph DB
e: thomas.cook@cambridgesemantics.com
w: www.anzograph.com
Knowledge Graphs for Machine Learning
and Data Science
#DCAF 2020
Feb 6, 2020

Data Continues to Grow
AI and ML Adoption Grows for Better & Faster Insights
Need for:
– Automated Data Preparation & Better Understanding
– Explainable AI & ML with Provenance
– Improved Algorithms & Analytics
– Cost Efficient Operations
Context
Knowledge
Graphs
&
Graph
Analytics

Knowledge Graphs to Automate
Data Preparation & Improve Common Understanding

©2019 Cambridge Semantics Inc. All rights reserved.
The Data Preparation Problem
Data Access
● Manual ETL coding
● Practicalities limit the # of
sources and types of data
Data Processing
● Laborious discovery, profiling
and selection
● Use of rules and coding for
harmonization & cleansing
Feature Engineering
● Manual coding to transform
data
● Manual feature engineering
& selection
1 Cleaning Big Data, Forbes Magazine
70-80% of time spent in Data Preparation & Feature Engineering
4
Viewed as the “least enjoyable” part of work by 76% of data scientists1
Structured Data

Automated Deployment and Operations
Storage and Compute Integration
MODEL
Graph Data Model
• Lift Data into
Data Fabric
• Design Ontologies
• Connect Data
Models
ON-BOARD
Ingest & Map
• Automated ETL
• Collaborative
Mapping
• Metadata
Capture
Enterprise
Data Sources
Machine
Learning and AI
Enterprise
Search
“Last Mile”
Analytics Tools
Metadata Catalog
Semantic-based Metadata Management, Governance and Lineage
Cloud or On-Prem Data Storage Infrastructure
Data Storage Layer
Ingest
BLEND
GraphMarts
• Combine and Align
Related Data Sets
• In-memory MPP
OLAP Query Engine
• Data Layers
ACCESS
Hi-Res Analytics
• Analyze All
Data Together
• Fast, Iterative Queries
Ad Hoc, What if
• Code Free or API
Graphical Application Interface
Anzo - The Modern Data Discovery and Integration Layer for the Enterprise Data Fabric

Automated Data Ingestion & Cataloging
Unstructured Data
Notes, Docs, Emails,
Articles
Structured Data
Relational, CSV,
HDFS, External
Data Feeds
CatalogIngest
NLP, Text Analytics,
Sentiment Analysis Data Catalog
Semantic
Layer
Data Harmonization – Structured or Unstructured
• Harmonize Many Data Sources
• Automated Unstructured Data Extraction &
Categorization
Data Wrangling Capability
• Profile, Rules… Manage & Clean incoming data
• Setup Re-usable Data Wrangling Jobs
• Provenance to Manage Data
Data Catalog
• Explore, Secure & Manage Dataset Assets
Result: Cleaner, quality data, faster & from many
more sources
6

A big web of data
understandable at
the data level

Allow for Easy Understanding & Handling of Data
Rules to Link &
Conform Data
Raw Data
Business Ready
Datasets Create Data
Layers
Build Graph
Marts

Expedite & Optimize Feature Engineering
Use visual interface with no coding
for feature selection
• Query Knowledge Graph and generate
features
• Conduct data transformation using a library
of functions
• Compute new derived features
• De-normalize data
• Aggregate ranges
• Convert numeric values to alpha values
• Pivot values
and much more!

Operationalizing of Machine Learning Models
10
Explainable
Insights
Manage Data Sets with
Provenance & Data Lineage
• Anzo retains end-to-end
data lineage
• Track transformations
Easy to Export Data to
ML & Data Science Tools
• Use Odata/REST APIs,
SQL ODBC/JDBC
• Export to R, Python,
downstream systems, …
Deploy & Continuously
Improve Model Performance
• Set up deployment pipelines with
learnings to help in feature selection
• Horizontally scale runtime environment
• Can be auto-deployed behind the
firewall or on the Cloud

Using Knowledge Graphs with
Graph Analytics Database as
Scalable Infrastructure for ML & Data Science

“Graph analytics will grow in the next few years
due to the need to ask complex questions across
complex data, which is not always practical or
even possible at scale using SQL queries”
…Gartner – Top 10 Data and Analytics Technology Trends for 2019

What it is:
● Fast, Scalable Graph Database
○ In-Memory Massively Parallel Processing
(MPP) ACID-Compliant Graph Database
○ Supports RDF & Labelled Property Graphs
What it does:
○ Fast Data Loading
○ Fast Query
○ Rich Analytics
■ Graph Algorithms
■ BI/DW Analytics
■ Inferencing
■ Data Science/Feature Engineering
Algorithms
■ Define-Your-Own Analytics
○ Linear Database Scaling
○ Persist data on cheap storage
Based on Open Standards
• Built on RDF & SPARQL 1.1 standards
• LPG with the RDF* /SPARQL*
• LPG with Cypher (in 2020)
Deploy on-prem or cloud
• Kubernetes/Helm on-demand cloud
deployment
• AWS, Google and Azure
AnzoGraph™ DB
Awards
Select Customers

217 X
AnzoGraph DB when compared
to Neo4j on and industry
standard
TPC-H & Graph 500
benchmarks
113 X
AnzoGraph’s LUBM
benchmark performance over
previous fastest result
30 X
AnzoGraph’s performance on
graph algorithms over SPARK
SQL and SPARK with
GraphFrames
Benchmarks

Graph OLAP Built for Analytics at Scale and Speed
SQL OLAP vs Graph OLAP
SQL OLAP Graph OLAP
On-line Analytics at Massive Parallel
Processing (MPP) Scale with SQL Database
Example
Netezza
Amazon Redshift
Analytics
• Warehouse-Style BI Analytics
On-line Analytics at Massive Parallel Processing (MPP)
Scale with Native Graph Database
Example
AnzoGraph DB
Analytics
• Warehouse-Style BI Analytics
• Graph Algorithms
• Inferencing
• Data Science Functions

Graph OLAP Built for Analytics at Scale and Speed
Graph OLTP vs Graph OLAP
Graph OLTP Graph OLAP
Transactional databases
• Built for building transactional
applications & individual
transactions
• Scales vertically
Example
Neo4j
AWS Neptune
Analytical databases
• Built for analytics and to deal with scale &
performance
• Deep Link analysis
• Analytics on the population
• Scales horizontally
• Can complement Graph OLTP systems
Example
AnzoGraph DB

Page
Labelled Property Graphs facilitates Analytics
isA: <Man>
birthday: 09/17/1975
isA: <Woman>
Birthday: 4/23/1979
isA: <Place>
has: Water
has: Trees
partOf: <TheMountain>
Person
: Jill
Person
: Jack
Place:
The
Hill
friendOf
WentUp
WentUp
metAt=<TheHill>
metDate=07/04/2018
Date=07/04/2018
Date=07/04/2018
Today with RDF* and SPARQL*
• Relationships can be described as
clearly as any LPG database
RDF*/SPARQL* extensions to the
standard make W3C open standards
databases even more capable

Page
Algorithms and Analytical Capabilities
Graph Patterns
Negation
Property Paths
BIND
Aggregates
Basic Federated Query
ORDER BY and offsets
Functions on Strings
Functions on Numerics
Functions on Dates and
Times
Hash Functions
Basic Graph Patterns
Count/Avg
Min/Max
GroupConcat
Sample
Page Rank
Shortest Path
All Path
Label Propagation
Weakly Connected
Components
K neighborhood
Counting Triangles
Inferences (RDFS+)
Labeled Property
Graphs (RDF*)
Window Aggregates
Advanced Grouping
Sets
Named Views
Named Queries
Conditional
Expressions
User-Defined
Extensions
SPARQL 1.1
Standards
AnzoGraph® DB
Extras
Graph Algorithms
and Inferencing
Data Science
Extensions (UDX)
Distributions
● Bernoulli
● Binomial
● Chi-squared
● Exponential
● Hypergeometric
● Laplace
● Log Normal
● Logarithmic Series
● Negative Binomial
● Normal
Correlations
● Pearson
Entropy
● Cross Entropy
● Differential Entropy

Page
User-defined Extensions (UDXs):
Allows users to extend AnzoGraph DB functionality for custom usage
User-Defined
Functions
(UDF)
Create and register custom analytic functions, such as functions that
concatenate values or convert integers to alternate currencies.
User-Defined
Aggregates
(UDA)
Create and register aggregate functions, such as functions that
compute the arithmetic mean or calculate the average number from
a list of maximum and minimum values.
User-Defined
Services
(UDS)
Create and register services that create local SPARQL endpoints.
User-Defined
Tables (UDT)
Create and register a function that is repeatedly invoked within a
query to generate the rows of a table on-the-fly.
Data
Science
Functions
User-
defined
Functions
(UDX)
Functions you can build in JAVA or C++

Execute Supervised & Unsupervised ML with Graph Algorithms
Graph Algorithm
• PageRank
• Shortest Path
• K-neighbors
• All Paths
• Counting Triangles
• Weakly Connected
Components
• Label Propagation
• Triangle Enumeration
• Triangle Counting
• Clustering Coefficient
and more!
Who is the most influential person in your
customer list?
What’s the most important item relating to a
search of your knowledge graph?
What is the shortest path to your destination
across a route?
What’s the optimal path for packets to travel
across your network
source: Wikipedia
Try functions via SPARQL in Zeppelin or python Jupyter Notebooks

Graph Algorithms produce additional Features to train ML Models
Graph Algorithms
source: Wikipedia

Execute Inferencing using RDFS+ and OWL 2 RL
Person:
Jack
Person:
Jill
Is Married
Inference
Is Married
Person:
Jack
Person:
Sam
Knows
Inference
Knows
AnzoGraph allows you to insert inferred triples into
the specified target graph
If Jack is married to Jill, then you can
definitely infer that Jill is married to Jack
Jack knows Sam, but Sam may not know Jack.
Here, the inference is less clear
Both cases are supported.

ELT for Data Engineering
•Wrangling, Blending, Munging, Transformations, Enrichment, Views
•Use statistical functions, transformations or enrichment to get the data
into the form needed for the downstream ML pipeline
INSERT {
graph <myNewGraph> {
?s a <Person>;
<fullname> ?fullname
}
}
USING <myOldGraph>
WHERE {
?s a <Person>;
<firstname> ?fname;
<lastname> ?lname;
BIND(CONCAT(?fname, “ “, ?lname) as ?fullname)
}

Materialized Views – good for heavy calculations - perform once - use many times
CREATE MATERIALIZED VIEW <ages> AS
CONSTRUCT { ?person <age> ?age . }
WHERE { GRAPH <tickit> {
{ SELECT ?person ((YEAR(?date))-(YEAR(xsd:dateTime(?birthdate))) AS ?age)
WHERE {
?person <birthday> ?birthdate .
BIND(xsd:dateTime(NOW()) AS ?date)
}
}
}
}

•Enrichment
Add new features from federated call using SERVICE call to
Linked Open Data Cloud or other internal SPARQL endpoints
Example:
Look up address and geocodes for company, census population data,
crime rate, demographics, etc. All these can be new features to fed
into ML pipeline

ML Step #1: Data Prep: Data Discovery and Feature Engineering
2.1 Bernoulli Distribution Determines the probability of Success or Failure (or Yes or No).
2.2 Binomial Distribution Determines the probability of success versus failure.
2.3 Chi-squared Distribution Determines the relationship between two categorical variables.
2.4 Exponential Distribution Determines the probability of event occurrence in time interval when past event number is unknown
2.5 Hypergeometric Distribution Determines the probability of success versus failure of a specific scenario.
2.6 Laplace Distribution Determines the probability of intervals.
2.7 Log Normal Distribution To model certain instances, such as the change in price distribution of a stock or commodity positions.
2.8 Logarithmic Series Distribution Determines the probability of occurrence of events like claim frequencies in insurance companies.
2.9 Negative Binomial Distribution Determines the probability of success versus failure.
2.10 Normal Distribution Model and determines probabilities of all natural and social data.
2.11 Poisson Distribution Determines the probability that a certain number of events will occur in a specific time period.
2.12 Skellam Distribution Determines the probability of two independent variables.
2.13 Beta-binomial Distribution Model number of successes in n binomial trials when probability of success p is a Beta random variable.
2.14 Continuous Uniform Distribution Assigns equal probability to all values between its minimum and maximum.
2.15 Discrete Uniform Distribution Determines the probability of finite number of outcomes equally likely to happen.
2.16 Student’s t-Distribution Determines the probability when sample size is small.
2.17 Weibull Distribution Used to assess product reliability, analyse life data and model failure times.

ML Step #1: Data Discovery and Feature Engineering
Correlations
3.1 Pearson Correlation Coefficient Determines the positive, negative or no relationship between two variables.
3.2 Matthews Correlation Coefficient Determines the positive, negative or no relationship between two binary variables (0 & 1).
3.3 Spearman’s Rank Correlation Coefficient Measures the strength of a linear relationship between paired data.
5.1 Principal Component Analysis Reduces the dimensionality of large data sets and making predictive models.
6.1 Geometric Mean Determines the average growth rates.
6.2 Skewness Metric Calculates Pearson’s coefficient of skewness on Numeric Values.
6.3 T-Digest Metric Determines the percentile and quantile values accurately.
Feature Exploration
Profiling Metrics

RDF*/SPARQL*
Real World Example: Airline Delay Data Analysis

https://www.transtats.bts.gov/ot_delay/OT_DelayCause1.asp?pn=1
Public Flight Delay Data Analysis

YEAR
MONTH
DAY
DAY_OF_WEEK
AIRLINE
FLIGHT_NUMBER
TAIL_NUMBER
ORIGIN_AIRPORT
DESTINATION_AIRPORT
SCHEDULED_DEPARTURE
DEPARTURE_TIME
DEPARTURE_DELAY
TAXI_OUT
WHEELS_OFF
SCHEDULED_TIME
ELAPSED_TIME
Input CSV – 32 Columns - 5,819,080 records
ELAPSED_TIME
AIR_TIME
DISTANCE
WHEELS_ON
TAXI_IN
SCHEDULED_ARRIVAL
ARRIVAL_TIME
ARRIVAL_DELAY
DIVERTED
CANCELLED
CANCELLATION_REASON
AIR_SYSTEM_DELAY
SECURITY_DELAY
AIRLINE_DELAY
LATE_AIRCRAFT_DELAY
WEATHER_DELAY

Conversion from CSV to Graph – Defining Triples
Flight
Airport
Airport
FlightDeparture
FlightArrival
DESTINATION
FlightAirport
Airport

Conversion from CSV to Graph
Flight
AirportAirport
FlightDeparture FlightArrival
DESTINATION

Nodes have types and properties
Flight
YEAR
MONTH
DAY
DAY_OF_WEEK
AIRLINE
FLIGHT_NUMBER
TAIL_NUMBER
ORIGIN_AIRPORT
DESTINATION_AIRPORT
….
Node Type: Flight
Node Properties:
Airline,
Flight Number,
Tail Number,
etc
*Note: Types can also be called Labels, as in Labeled Property
Graphs or LPG

With RDF* edges can also have properties
AirportAirport
DESTINATION
DISTANCE = 187
AIRPORT_CODE = ‘BOS”
Edge Property:
DISTANCE
AIRPORT_CODE = ‘JFK”

Page
...
TABLE <s3://csi-notebook-datasets/Flight_Dataset/flights10k.csv>
('ContentType'='text/CSV','Schema'=',H,YEAR:int,MONTH:int,DAY:int,DAY_OF_WEEK:int,AIRLINE:c
har,FLIGHT_NUMBER:char,TAIL_NUMBER:char,ORIGIN_AIRPORT:char,DESTINATION_AIRPORT:char,SCHEDU
LED_DEPARTURE:char,DEPARTURE_TIME:char,DEPARTURE_DELAY:int,TAXI_OUT:int,WHEELS_OFF:char,SCH
EDULED_TIME:int,ELAPSED_TIME:int,AIR_TIME:int,DISTANCE:int,WHEELS_ON:char,TAXI_IN:int,SCHED
ULED_ARRIVAL:char,ARRIVAL_TIME:char,ARRIVAL_DELAY:int,DIVERTED:int,CANCELLED:int,CANCELLATI
ON_REASON:char,AIR_SYSTEM_DELAY:int,SECURITY_DELAY:int,AIRLINE_DELAY:int,LATE_AIRCRAFT_DELA
Y:int,WEATHER_DELAY:int')
Loading CSV with TABLE expression
Specify CSV file
name and location
CSV Column names
and Data Types

Page
INSERT { GRAPH <airline_flight_network> {
?OriginIRI a <Airport> ;
<AIRPORT_CODE> ?ORIGIN_AIRPORT .
<< ?OriginIRI <DESTINATION> ?DestinationIRI >> <DISTANCE> ?DISTANCE .
?DestinationIRI a <Airport> ;
<AIRPORT_CODE> ?DESTINATION_AIRPORT .
<< ?DestinationIRI <DESTINATION> ?OriginIRI >> <DISTANCE> ?DISTANCE .
?FlightIRI a <Flight> ;
<YEAR> ?YEAR ;
<MONTH> ?MONTH ;
<DAY> ?DAY ;
<DAY_OF_WEEK> ?DAY_OF_WEEK ;
<AIRLINE> ?AIRLINE;
<FLIGHT_NUMBER> ?FLIGHT_NUMBER;
<TAIL_NUMBER> ?TAIL_NUMBER;
<ORIGIN_AIRPORT> ?ORIGIN_AIRPORT;
<DESTINATION_AIRPORT> ?DESTINATION_AIRPORT;
Conversion from CSV to RDF* Triples via SPARQL
Node Type: Airport
Node Type: Airport
Node Type: Flight
Node Properties:
Flight Number,
Tail Number,
etc
Edge Property:
Distance

Flight Delay Data
Airport Info
Census Data
FAA Aircraft Registrations
Integration and ELT

Combining additional data sets
Flight
AirportAirport
FlightDeparture FlightArrival
DESTINATION
CityState
Aircraft
Airline
Country
Airline
Aircraft
CityState
Country
FAA Airline Census Data
Flight Delay

Now we are ready to ask questions like:
BI-Style Analytics
#1 Longest flight segments by distance from Boston (BOS)
#2 Airports less the 400 mi from Boston (BOS) - Network Viewer output
#3 Longest distances between two airports
#4 Longest flights by elapsed time
#5 Airlines with the longest average delays
#6 Airlines with the most flights
#7 Longest 2 segments reachable from Boston and the distances of each segment
#8 Which segments have the longest average departure delays
Graph Algorithms
#9 Page Rank - Graph Algorithm - Show most well-connected airports based on page rank algorithm
#10 Shortest Path Graph Algorithm - show shortest paths and # of segments (hops) from AUS

select * from <airline_flight_network> where
{ SERVICE <csi:shortest_path> {
[]
<csi:binding-source-vertex> ?source_vertex_variable_name ;
<csi:binding-vertex> ?node ;
<csi:binding-predecessor> ?predecessor_variable_name ;
<csi:binding-distance> ?distance ;
<csi:graph> <airline_flight_network> ;
<csi:source-vertex> <BOS> ;
<csi:destination-vertex> <HNL> ;
<csi:edge-label> <DESTINATION> ;
<csi:weighted> true .
}
}
Shortest Path graph algorithm leverages RDF*/SPARQL*

Scalability
Graph OLAP – Horizontally Scalable
Have more data. Need better performance. Add more servers
Deploy on VMs or bare metal with a TAR file that
is compatible with CentOS
Automated deployment in the Cloud.
Available in the AWS Marketplace & others soon
60 Day Full Feature Free Trial. Download or Cloud Deployment. Visit booth or AnzoGraph.com

Download AnzoGraph DB Free Edition Today! http://AnzoGraph.com

Knowledge Graph for Machine Learning and Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Knowledge Graph for Machine Learning and Data Science

Similar to Knowledge Graph for Machine Learning and Data Science (20)

More from Cambridge Semantics

More from Cambridge Semantics (20)

Recently uploaded

Recently uploaded (20)

Knowledge Graph for Machine Learning and Data Science