Knowledge Graph for Machine Learning and Data Science
Mar. 6, 2020•0 likes
1 likes
Be the first to like this
Show More
•234 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Data & Analytics
At Data-centric Architecture Forum 2020 Thomas Cook, our Sales Director of AnzoGraph DB, gave his presentation "Knowledge Graph for Machine Learning and Data Science". These are his slides.
Knowledge Graph for Machine Learning and Data Science
Thomas Cook
Sales Director, AnzoGraph DB
e: thomas.cook@cambridgesemantics.com
w: www.anzograph.com
Knowledge Graphs for Machine Learning
and Data Science
#DCAF 2020
Feb 6, 2020
Data Continues to Grow
AI and ML Adoption Grows for Better & Faster Insights
Need for:
– Automated Data Preparation & Better Understanding
– Explainable AI & ML with Provenance
– Improved Algorithms & Analytics
– Cost Efficient Operations
Context
Knowledge
Graphs
&
Graph
Analytics
Automated Deployment and Operations
Storage and Compute Integration
MODEL
Graph Data Model
• Lift Data into
Data Fabric
• Design Ontologies
• Connect Data
Models
ON-BOARD
Ingest & Map
• Automated ETL
• Collaborative
Mapping
• Metadata
Capture
Enterprise
Data Sources
Machine
Learning and AI
Enterprise
Search
“Last Mile”
Analytics Tools
Metadata Catalog
Semantic-based Metadata Management, Governance and Lineage
Cloud or On-Prem Data Storage Infrastructure
Data Storage Layer
Ingest
BLEND
GraphMarts
• Combine and Align
Related Data Sets
• In-memory MPP
OLAP Query Engine
• Data Layers
ACCESS
Hi-Res Analytics
• Analyze All
Data Together
• Fast, Iterative Queries
Ad Hoc, What if
• Code Free or API
Graphical Application Interface
Anzo - The Modern Data Discovery and Integration Layer for the Enterprise Data Fabric
Using Knowledge Graphs with
Graph Analytics Database as
Scalable Infrastructure for ML & Data Science
“Graph analytics will grow in the next few years
due to the need to ask complex questions across
complex data, which is not always practical or
even possible at scale using SQL queries”
…Gartner – Top 10 Data and Analytics Technology Trends for 2019
What it is:
● Fast, Scalable Graph Database
○ In-Memory Massively Parallel Processing
(MPP) ACID-Compliant Graph Database
○ Supports RDF & Labelled Property Graphs
What it does:
○ Fast Data Loading
○ Fast Query
○ Rich Analytics
■ Graph Algorithms
■ BI/DW Analytics
■ Inferencing
■ Data Science/Feature Engineering
Algorithms
■ Define-Your-Own Analytics
○ Linear Database Scaling
○ Persist data on cheap storage
Based on Open Standards
• Built on RDF & SPARQL 1.1 standards
• LPG with the RDF* /SPARQL*
• LPG with Cypher (in 2020)
Deploy on-prem or cloud
• Kubernetes/Helm on-demand cloud
deployment
• AWS, Google and Azure
AnzoGraph™ DB
Awards
Select Customers
217 X
AnzoGraph DB when compared
to Neo4j on and industry
standard
TPC-H & Graph 500
benchmarks
113 X
AnzoGraph’s LUBM
benchmark performance over
previous fastest result
30 X
AnzoGraph’s performance on
graph algorithms over SPARK
SQL and SPARK with
GraphFrames
Benchmarks
Page
Labelled Property Graphs facilitates Analytics
isA: <Man>
birthday: 09/17/1975
isA: <Woman>
Birthday: 4/23/1979
isA: <Place>
has: Water
has: Trees
partOf: <TheMountain>
Person
: Jill
Person
: Jack
Place:
The
Hill
friendOf
WentUp
WentUp
metAt=<TheHill>
metDate=07/04/2018
Date=07/04/2018
Date=07/04/2018
Today with RDF* and SPARQL*
• Relationships can be described as
clearly as any LPG database
RDF*/SPARQL* extensions to the
standard make W3C open standards
databases even more capable
Page
Algorithms and Analytical Capabilities
Graph Patterns
Negation
Property Paths
BIND
Aggregates
Basic Federated Query
ORDER BY and offsets
Functions on Strings
Functions on Numerics
Functions on Dates and
Times
Hash Functions
Basic Graph Patterns
Count/Avg
Min/Max
GroupConcat
Sample
Page Rank
Shortest Path
All Path
Label Propagation
Weakly Connected
Components
K neighborhood
Counting Triangles
Inferences (RDFS+)
Labeled Property
Graphs (RDF*)
Window Aggregates
Advanced Grouping
Sets
Named Views
Named Queries
Conditional
Expressions
User-Defined
Extensions
SPARQL 1.1
Standards
AnzoGraph® DB
Extras
Graph Algorithms
and Inferencing
Data Science
Extensions (UDX)
Distributions
● Bernoulli
● Binomial
● Chi-squared
● Exponential
● Hypergeometric
● Laplace
● Log Normal
● Logarithmic Series
● Negative Binomial
● Normal
Correlations
● Pearson
Entropy
● Cross Entropy
● Differential Entropy
Page
User-defined Extensions (UDXs):
Allows users to extend AnzoGraph DB functionality for custom usage
User-Defined
Functions
(UDF)
Create and register custom analytic functions, such as functions that
concatenate values or convert integers to alternate currencies.
User-Defined
Aggregates
(UDA)
Create and register aggregate functions, such as functions that
compute the arithmetic mean or calculate the average number from
a list of maximum and minimum values.
User-Defined
Services
(UDS)
Create and register services that create local SPARQL endpoints.
User-Defined
Tables (UDT)
Create and register a function that is repeatedly invoked within a
query to generate the rows of a table on-the-fly.
Data
Science
Functions
User-
defined
Functions
(UDX)
Functions you can build in JAVA or C++
Page
Labelled Property Graphs facilitates Analytics
isA: <Man>
birthday: 09/17/1975
isA: <Woman>
Birthday: 4/23/1979
isA: <Place>
has: Water
has: Trees
partOf: <TheMountain>
Person
: Jill
Person
: Jack
Place:
The
Hill
friendOf
WentUp
WentUp
metAt=<TheHill>
metDate=07/04/2018
Date=07/04/2018
Date=07/04/2018
Today with RDF* and SPARQL*
• Relationships can be described as
clearly as any LPG database
RDF*/SPARQL* extensions to the
standard make W3C open standards
databases even more capable
Conversion from CSV to Graph – Defining Triples
Flight
Airport
Airport
FlightDeparture
FlightArrival
DESTINATION
FlightAirport
Airport
Conversion from CSV to Graph
Flight
AirportAirport
FlightDeparture FlightArrival
DESTINATION
Nodes have types and properties
Flight
YEAR
MONTH
DAY
DAY_OF_WEEK
AIRLINE
FLIGHT_NUMBER
TAIL_NUMBER
ORIGIN_AIRPORT
DESTINATION_AIRPORT
….
Node Type: Flight
Node Properties:
Airline,
Flight Number,
Tail Number,
etc
*Note: Types can also be called Labels, as in Labeled Property
Graphs or LPG
With RDF* edges can also have properties
AirportAirport
DESTINATION
DISTANCE = 187
AIRPORT_CODE = ‘BOS”
Edge Property:
DISTANCE
AIRPORT_CODE = ‘JFK”
Combining additional data sets
Flight
AirportAirport
FlightDeparture FlightArrival
DESTINATION
CityState
Aircraft
Airline
Country
Airline
Aircraft
CityState
Country
FAA Airline Census Data
Flight Delay
Now we are ready to ask questions like:
BI-Style Analytics
#1 Longest flight segments by distance from Boston (BOS)
#2 Airports less the 400 mi from Boston (BOS) - Network Viewer output
#3 Longest distances between two airports
#4 Longest flights by elapsed time
#5 Airlines with the longest average delays
#6 Airlines with the most flights
#7 Longest 2 segments reachable from Boston and the distances of each segment
#8 Which segments have the longest average departure delays
Graph Algorithms
#9 Page Rank - Graph Algorithm - Show most well-connected airports based on page rank algorithm
#10 Shortest Path Graph Algorithm - show shortest paths and # of segments (hops) from AUS