Report

Cambridge SemanticsFollow

Mar. 6, 2020•0 likes## 1 likes

•234 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Mar. 6, 2020•0 likes## 1 likes

•234 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Data & Analytics

At Data-centric Architecture Forum 2020 Thomas Cook, our Sales Director of AnzoGraph DB, gave his presentation "Knowledge Graph for Machine Learning and Data Science". These are his slides.

Cambridge SemanticsFollow

- Thomas Cook Sales Director, AnzoGraph DB e: thomas.cook@cambridgesemantics.com w: www.anzograph.com Knowledge Graphs for Machine Learning and Data Science #DCAF 2020 Feb 6, 2020
- Data Continues to Grow AI and ML Adoption Grows for Better & Faster Insights Need for: – Automated Data Preparation & Better Understanding – Explainable AI & ML with Provenance – Improved Algorithms & Analytics – Cost Efficient Operations Context Knowledge Graphs & Graph Analytics
- Knowledge Graphs to Automate Data Preparation & Improve Common Understanding
- ©2019 Cambridge Semantics Inc. All rights reserved. The Data Preparation Problem Data Access ● Manual ETL coding ● Practicalities limit the # of sources and types of data Data Processing ● Laborious discovery, profiling and selection ● Use of rules and coding for harmonization & cleansing Feature Engineering ● Manual coding to transform data ● Manual feature engineering & selection 1 Cleaning Big Data, Forbes Magazine 70-80% of time spent in Data Preparation & Feature Engineering 4 Viewed as the “least enjoyable” part of work by 76% of data scientists1 Structured Data
- Automated Deployment and Operations Storage and Compute Integration MODEL Graph Data Model • Lift Data into Data Fabric • Design Ontologies • Connect Data Models ON-BOARD Ingest & Map • Automated ETL • Collaborative Mapping • Metadata Capture Enterprise Data Sources Machine Learning and AI Enterprise Search “Last Mile” Analytics Tools Metadata Catalog Semantic-based Metadata Management, Governance and Lineage Cloud or On-Prem Data Storage Infrastructure Data Storage Layer Ingest BLEND GraphMarts • Combine and Align Related Data Sets • In-memory MPP OLAP Query Engine • Data Layers ACCESS Hi-Res Analytics • Analyze All Data Together • Fast, Iterative Queries Ad Hoc, What if • Code Free or API Graphical Application Interface Anzo - The Modern Data Discovery and Integration Layer for the Enterprise Data Fabric
- ©2019 Cambridge Semantics Inc. All rights reserved. Automated Data Ingestion & Cataloging Unstructured Data Notes, Docs, Emails, Articles Structured Data Relational, CSV, HDFS, External Data Feeds CatalogIngest NLP, Text Analytics, Sentiment Analysis Data Catalog Semantic Layer Data Harmonization – Structured or Unstructured • Harmonize Many Data Sources • Automated Unstructured Data Extraction & Categorization Data Wrangling Capability • Profile, Rules… Manage & Clean incoming data • Setup Re-usable Data Wrangling Jobs • Provenance to Manage Data Data Catalog • Explore, Secure & Manage Dataset Assets Result: Cleaner, quality data, faster & from many more sources 6
- ©2018 Cambridge Semantics Inc. All rights reserved. A big web of data understandable at the data level
- ©2019 Cambridge Semantics Inc. All rights reserved. Allow for Easy Understanding & Handling of Data Rules to Link & Conform Data Raw Data Business Ready Datasets Create Data Layers Build Graph Marts
- ©2019 Cambridge Semantics Inc. All rights reserved. Expedite & Optimize Feature Engineering Use visual interface with no coding for feature selection • Query Knowledge Graph and generate features • Conduct data transformation using a library of functions • Compute new derived features • De-normalize data • Aggregate ranges • Convert numeric values to alpha values • Pivot values and much more!
- ©2019 Cambridge Semantics Inc. All rights reserved. Operationalizing of Machine Learning Models 10 Explainable Insights Manage Data Sets with Provenance & Data Lineage • Anzo retains end-to-end data lineage • Track transformations Easy to Export Data to ML & Data Science Tools • Use Odata/REST APIs, SQL ODBC/JDBC • Export to R, Python, downstream systems, … Deploy & Continuously Improve Model Performance • Set up deployment pipelines with learnings to help in feature selection • Horizontally scale runtime environment • Can be auto-deployed behind the firewall or on the Cloud
- Using Knowledge Graphs with Graph Analytics Database as Scalable Infrastructure for ML & Data Science
- “Graph analytics will grow in the next few years due to the need to ask complex questions across complex data, which is not always practical or even possible at scale using SQL queries” …Gartner – Top 10 Data and Analytics Technology Trends for 2019
- What it is: ● Fast, Scalable Graph Database ○ In-Memory Massively Parallel Processing (MPP) ACID-Compliant Graph Database ○ Supports RDF & Labelled Property Graphs What it does: ○ Fast Data Loading ○ Fast Query ○ Rich Analytics ■ Graph Algorithms ■ BI/DW Analytics ■ Inferencing ■ Data Science/Feature Engineering Algorithms ■ Define-Your-Own Analytics ○ Linear Database Scaling ○ Persist data on cheap storage Based on Open Standards • Built on RDF & SPARQL 1.1 standards • LPG with the RDF* /SPARQL* • LPG with Cypher (in 2020) Deploy on-prem or cloud • Kubernetes/Helm on-demand cloud deployment • AWS, Google and Azure AnzoGraph™ DB Awards Select Customers
- 217 X AnzoGraph DB when compared to Neo4j on and industry standard TPC-H & Graph 500 benchmarks 113 X AnzoGraph’s LUBM benchmark performance over previous fastest result 30 X AnzoGraph’s performance on graph algorithms over SPARK SQL and SPARK with GraphFrames Benchmarks
- ©2019 Cambridge Semantics Inc. All rights reserved. Graph OLAP Built for Analytics at Scale and Speed SQL OLAP vs Graph OLAP SQL OLAP Graph OLAP On-line Analytics at Massive Parallel Processing (MPP) Scale with SQL Database Example Netezza Amazon Redshift Analytics • Warehouse-Style BI Analytics On-line Analytics at Massive Parallel Processing (MPP) Scale with Native Graph Database Example AnzoGraph DB Analytics • Warehouse-Style BI Analytics • Graph Algorithms • Inferencing • Data Science Functions
- ©2019 Cambridge Semantics Inc. All rights reserved. Graph OLAP Built for Analytics at Scale and Speed Graph OLTP vs Graph OLAP Graph OLTP Graph OLAP Transactional databases • Built for building transactional applications & individual transactions • Scales vertically Example Neo4j AWS Neptune Analytical databases • Built for analytics and to deal with scale & performance • Deep Link analysis • Analytics on the population • Scales horizontally • Can complement Graph OLTP systems Example AnzoGraph DB
- Page Labelled Property Graphs facilitates Analytics isA: <Man> birthday: 09/17/1975 isA: <Woman> Birthday: 4/23/1979 isA: <Place> has: Water has: Trees partOf: <TheMountain> Person : Jill Person : Jack Place: The Hill friendOf WentUp WentUp metAt=<TheHill> metDate=07/04/2018 Date=07/04/2018 Date=07/04/2018 Today with RDF* and SPARQL* • Relationships can be described as clearly as any LPG database RDF*/SPARQL* extensions to the standard make W3C open standards databases even more capable
- Page Algorithms and Analytical Capabilities Graph Patterns Negation Property Paths BIND Aggregates Basic Federated Query ORDER BY and offsets Functions on Strings Functions on Numerics Functions on Dates and Times Hash Functions Basic Graph Patterns Count/Avg Min/Max GroupConcat Sample Page Rank Shortest Path All Path Label Propagation Weakly Connected Components K neighborhood Counting Triangles Inferences (RDFS+) Labeled Property Graphs (RDF*) Window Aggregates Advanced Grouping Sets Named Views Named Queries Conditional Expressions User-Defined Extensions SPARQL 1.1 Standards AnzoGraph® DB Extras Graph Algorithms and Inferencing Data Science Extensions (UDX) Distributions ● Bernoulli ● Binomial ● Chi-squared ● Exponential ● Hypergeometric ● Laplace ● Log Normal ● Logarithmic Series ● Negative Binomial ● Normal Correlations ● Pearson Entropy ● Cross Entropy ● Differential Entropy
- Page User-defined Extensions (UDXs): Allows users to extend AnzoGraph DB functionality for custom usage User-Defined Functions (UDF) Create and register custom analytic functions, such as functions that concatenate values or convert integers to alternate currencies. User-Defined Aggregates (UDA) Create and register aggregate functions, such as functions that compute the arithmetic mean or calculate the average number from a list of maximum and minimum values. User-Defined Services (UDS) Create and register services that create local SPARQL endpoints. User-Defined Tables (UDT) Create and register a function that is repeatedly invoked within a query to generate the rows of a table on-the-fly. Data Science Functions User- defined Functions (UDX) Functions you can build in JAVA or C++
- ©2019 Cambridge Semantics Inc. All rights reserved. Execute Supervised & Unsupervised ML with Graph Algorithms Graph Algorithm • PageRank • Shortest Path • K-neighbors • All Paths • Counting Triangles • Weakly Connected Components • Label Propagation • Triangle Enumeration • Triangle Counting • Clustering Coefficient and more! Who is the most influential person in your customer list? What’s the most important item relating to a search of your knowledge graph? What is the shortest path to your destination across a route? What’s the optimal path for packets to travel across your network source: Wikipedia Try functions via SPARQL in Zeppelin or python Jupyter Notebooks
- ©2019 Cambridge Semantics Inc. All rights reserved. Graph Algorithms produce additional Features to train ML Models Graph Algorithms source: Wikipedia Try functions via SPARQL in Zeppelin or python Jupyter Notebooks
- ©2019 Cambridge Semantics Inc. All rights reserved. Execute Inferencing using RDFS+ and OWL 2 RL Person: Jack Person: Jill Is Married Inference Is Married Person: Jack Person: Sam Knows Inference Knows AnzoGraph allows you to insert inferred triples into the specified target graph If Jack is married to Jill, then you can definitely infer that Jill is married to Jack Jack knows Sam, but Sam may not know Jack. Here, the inference is less clear Both cases are supported. Try functions via SPARQL in Zeppelin or python Jupyter Notebooks
- ©2019 Cambridge Semantics Inc. All rights reserved. ELT for Data Engineering •Wrangling, Blending, Munging, Transformations, Enrichment, Views •Use statistical functions, transformations or enrichment to get the data into the form needed for the downstream ML pipeline INSERT { graph <myNewGraph> { ?s a <Person>; <fullname> ?fullname } } USING <myOldGraph> WHERE { ?s a <Person>; <firstname> ?fname; <lastname> ?lname; BIND(CONCAT(?fname, “ “, ?lname) as ?fullname) }
- ©2019 Cambridge Semantics Inc. All rights reserved. ELT for Data Engineering Materialized Views – good for heavy calculations - perform once - use many times CREATE MATERIALIZED VIEW <ages> AS CONSTRUCT { ?person <age> ?age . } WHERE { GRAPH <tickit> { { SELECT ?person ((YEAR(?date))-(YEAR(xsd:dateTime(?birthdate))) AS ?age) WHERE { ?person <birthday> ?birthdate . BIND(xsd:dateTime(NOW()) AS ?date) } } } }
- ©2019 Cambridge Semantics Inc. All rights reserved. ELT for Data Engineering •Enrichment Add new features from federated call using SERVICE call to Linked Open Data Cloud or other internal SPARQL endpoints Example: Look up address and geocodes for company, census population data, crime rate, demographics, etc. All these can be new features to fed into ML pipeline
- ©2019 Cambridge Semantics Inc. All rights reserved. ML Step #1: Data Prep: Data Discovery and Feature Engineering 2.1 Bernoulli Distribution Determines the probability of Success or Failure (or Yes or No). 2.2 Binomial Distribution Determines the probability of success versus failure. 2.3 Chi-squared Distribution Determines the relationship between two categorical variables. 2.4 Exponential Distribution Determines the probability of event occurrence in time interval when past event number is unknown 2.5 Hypergeometric Distribution Determines the probability of success versus failure of a specific scenario. 2.6 Laplace Distribution Determines the probability of intervals. 2.7 Log Normal Distribution To model certain instances, such as the change in price distribution of a stock or commodity positions. 2.8 Logarithmic Series Distribution Determines the probability of occurrence of events like claim frequencies in insurance companies. 2.9 Negative Binomial Distribution Determines the probability of success versus failure. 2.10 Normal Distribution Model and determines probabilities of all natural and social data. 2.11 Poisson Distribution Determines the probability that a certain number of events will occur in a specific time period. 2.12 Skellam Distribution Determines the probability of two independent variables. 2.13 Beta-binomial Distribution Model number of successes in n binomial trials when probability of success p is a Beta random variable. 2.14 Continuous Uniform Distribution Assigns equal probability to all values between its minimum and maximum. 2.15 Discrete Uniform Distribution Determines the probability of finite number of outcomes equally likely to happen. 2.16 Student’s t-Distribution Determines the probability when sample size is small. 2.17 Weibull Distribution Used to assess product reliability, analyse life data and model failure times.
- ©2019 Cambridge Semantics Inc. All rights reserved. ML Step #1: Data Discovery and Feature Engineering Correlations 3.1 Pearson Correlation Coefficient Determines the positive, negative or no relationship between two variables. 3.2 Matthews Correlation Coefficient Determines the positive, negative or no relationship between two binary variables (0 & 1). 3.3 Spearman’s Rank Correlation Coefficient Measures the strength of a linear relationship between paired data. 5.1 Principal Component Analysis Reduces the dimensionality of large data sets and making predictive models. 6.1 Geometric Mean Determines the average growth rates. 6.2 Skewness Metric Calculates Pearson’s coefficient of skewness on Numeric Values. 6.3 T-Digest Metric Determines the percentile and quantile values accurately. Feature Exploration Profiling Metrics
- RDF*/SPARQL* Real World Example: Airline Delay Data Analysis
- Page Labelled Property Graphs facilitates Analytics isA: <Man> birthday: 09/17/1975 isA: <Woman> Birthday: 4/23/1979 isA: <Place> has: Water has: Trees partOf: <TheMountain> Person : Jill Person : Jack Place: The Hill friendOf WentUp WentUp metAt=<TheHill> metDate=07/04/2018 Date=07/04/2018 Date=07/04/2018 Today with RDF* and SPARQL* • Relationships can be described as clearly as any LPG database RDF*/SPARQL* extensions to the standard make W3C open standards databases even more capable
- https://www.transtats.bts.gov/ot_delay/OT_DelayCause1.asp?pn=1 Public Flight Delay Data Analysis
- YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME Input CSV – 32 Columns - 5,819,080 records ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
- Conversion from CSV to Graph – Defining Triples Flight Airport Airport FlightDeparture FlightArrival DESTINATION FlightAirport Airport
- Conversion from CSV to Graph Flight AirportAirport FlightDeparture FlightArrival DESTINATION
- Nodes have types and properties Flight YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT …. Node Type: Flight Node Properties: Airline, Flight Number, Tail Number, etc *Note: Types can also be called Labels, as in Labeled Property Graphs or LPG
- With RDF* edges can also have properties AirportAirport DESTINATION DISTANCE = 187 AIRPORT_CODE = ‘BOS” Edge Property: DISTANCE AIRPORT_CODE = ‘JFK”
- Page ... TABLE <s3://csi-notebook-datasets/Flight_Dataset/flights10k.csv> ('ContentType'='text/CSV','Schema'=',H,YEAR:int,MONTH:int,DAY:int,DAY_OF_WEEK:int,AIRLINE:c har,FLIGHT_NUMBER:char,TAIL_NUMBER:char,ORIGIN_AIRPORT:char,DESTINATION_AIRPORT:char,SCHEDU LED_DEPARTURE:char,DEPARTURE_TIME:char,DEPARTURE_DELAY:int,TAXI_OUT:int,WHEELS_OFF:char,SCH EDULED_TIME:int,ELAPSED_TIME:int,AIR_TIME:int,DISTANCE:int,WHEELS_ON:char,TAXI_IN:int,SCHED ULED_ARRIVAL:char,ARRIVAL_TIME:char,ARRIVAL_DELAY:int,DIVERTED:int,CANCELLED:int,CANCELLATI ON_REASON:char,AIR_SYSTEM_DELAY:int,SECURITY_DELAY:int,AIRLINE_DELAY:int,LATE_AIRCRAFT_DELA Y:int,WEATHER_DELAY:int') Loading CSV with TABLE expression Specify CSV file name and location CSV Column names and Data Types
- Page INSERT { GRAPH <airline_flight_network> { ?OriginIRI a <Airport> ; <AIRPORT_CODE> ?ORIGIN_AIRPORT . << ?OriginIRI <DESTINATION> ?DestinationIRI >> <DISTANCE> ?DISTANCE . ?DestinationIRI a <Airport> ; <AIRPORT_CODE> ?DESTINATION_AIRPORT . << ?DestinationIRI <DESTINATION> ?OriginIRI >> <DISTANCE> ?DISTANCE . ?FlightIRI a <Flight> ; <YEAR> ?YEAR ; <MONTH> ?MONTH ; <DAY> ?DAY ; <DAY_OF_WEEK> ?DAY_OF_WEEK ; <AIRLINE> ?AIRLINE; <FLIGHT_NUMBER> ?FLIGHT_NUMBER; <TAIL_NUMBER> ?TAIL_NUMBER; <ORIGIN_AIRPORT> ?ORIGIN_AIRPORT; <DESTINATION_AIRPORT> ?DESTINATION_AIRPORT; Conversion from CSV to RDF* Triples via SPARQL Node Type: Airport Node Type: Airport Node Type: Flight Node Properties: Flight Number, Tail Number, etc Edge Property: Distance
- Flight Delay Data Airport Info Census Data FAA Aircraft Registrations Integration and ELT
- Combining additional data sets Flight AirportAirport FlightDeparture FlightArrival DESTINATION CityState Aircraft Airline Country Airline Aircraft CityState Country FAA Airline Census Data Flight Delay
- Now we are ready to ask questions like: BI-Style Analytics #1 Longest flight segments by distance from Boston (BOS) #2 Airports less the 400 mi from Boston (BOS) - Network Viewer output #3 Longest distances between two airports #4 Longest flights by elapsed time #5 Airlines with the longest average delays #6 Airlines with the most flights #7 Longest 2 segments reachable from Boston and the distances of each segment #8 Which segments have the longest average departure delays Graph Algorithms #9 Page Rank - Graph Algorithm - Show most well-connected airports based on page rank algorithm #10 Shortest Path Graph Algorithm - show shortest paths and # of segments (hops) from AUS
- select * from <airline_flight_network> where { SERVICE <csi:shortest_path> { [] <csi:binding-source-vertex> ?source_vertex_variable_name ; <csi:binding-vertex> ?node ; <csi:binding-predecessor> ?predecessor_variable_name ; <csi:binding-distance> ?distance ; <csi:graph> <airline_flight_network> ; <csi:source-vertex> <BOS> ; <csi:destination-vertex> <HNL> ; <csi:edge-label> <DESTINATION> ; <csi:weighted> true . } } Shortest Path graph algorithm leverages RDF*/SPARQL*
- AIRLINE DEMO
- ©2019 Cambridge Semantics Inc. All rights reserved. Scalability Graph OLAP – Horizontally Scalable Have more data. Need better performance. Add more servers Deploy on VMs or bare metal with a TAR file that is compatible with CentOS Automated deployment in the Cloud. Available in the AWS Marketplace & others soon 60 Day Full Feature Free Trial. Download or Cloud Deployment. Visit booth or AnzoGraph.com
- Download AnzoGraph DB Free Edition Today! http://AnzoGraph.com
- THANK YOU