Release webinar: Sansa and Ontario

Distributed Knowledge Graph
Processing in SANSA and
Ontario
Prof. Dr. Jens Lehmann
Big Data Europe Platform ReleaseMay 3rd 2017

SANSA: Motivation
◎ Abundant machine readable structured information is
available (e.g. in RDF)
o Across BDE Societal Challenges, e.g. for Life Science Data
o General: DBpedia, Google knowledge graph
o Social graphs: Facebook, Twitter
◎ Need for scalable querying, inference and machine
learning
o Link prediction
o Knowledge base completion
o Predictive analytics
o Forward chaining inference
2

SANSA Stack
4
◎ SANSA includes several libraries:
o Read / Write RDF / OWL library
o Querying library
o Inference library
o ML- Machine Learning core library
http://sansa-stack.net/
◎ Framework for distributed RDF dataset processing
aiming at high scalability and fault tolerance

SANSA: Read Write Layer
◎ Ingest RDF and OWL data in different formats using
Jena / OWL API style interfaces
◎ Goal: Represent/distribute data in multiple formats
(e.g. RDD, Data Frames, GraphX, Tensors)
◎ Goal: Allow transformation among these formats
◎ Compute dataset statistics and apply functions to URIs,
literals, subjects, objects → Distributed LODStats
6

SANSA: Query Layer
◎ Improve performance of generic queries via:
o Intelligent indexing
o Splitting strategies
o Distributed Storage
◎ SPARQL-to-SQL approaches, Virtual Views
◎ Next Step: SPARQL query engine evaluation
◎ SANSA provides a W3C SPARQL compliant endpoint
8

SANSA: Inference Layer
◎ W3C Standards for Modelling: RDFS and OWL
◎ Parallel in-memory inference via rule-based forward
chaining
◎ Beyond state of the art: dynamically build a rule
dependency graph for a rule set
◎ → Adjustable performance levels
10

SANSA: ML Layer
◎ Distributed Machine Learning (ML) algorithms that work
on RDF data and make use of its structure / semantics
◎ Work in Progress:
o Tensor Factorization for e.g. KB completion (testing stage)
o Graph Clustering (testing stage)
o Association rule mining (evaluation stage)
o Semantic Decision trees (idea stage)
o Inference in Knowledge Graph Embeddings (idea stage)
12

SANSA Status and Releases
◎ A generic stack for (big) Linked Data
o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
o Easy to use: just add the dependencies to your existing Spark or Flink
project
◎ Out-of-the-box framework for (1) querying, (2) inference
and (3) machine learning over RDF datasets
◎ 0.1 Release in December
◎ Subsequent releases every 6 months
13

Semantic Data Lake (Ontario)
◎ Data Lake or Swamp?
o Repository of data in its original formats
o Structured, semi-structured, unstructured
o Without unified schema
◎ Semantic Data Lake
o Add a Semantic Layer on top of the source datasets
❖ The data is semantically lifted using ontology
terms
❖ Provide a uniform view over nonuniform data
14

◎ A SPARQL query is decomposed into sub-queries
o Keeping links between sub-queries to collect data
returned later from each sub-query
◎ Data Lake is checked for candidate data sources that
can answer each sub-query (consulting metadata)
◎ Each SPARQL sub-query is translated to a query on the
query language of the selected candidate data source
o Examples: SPARQL to SQL, or SPARQL to CQL
◎ Sub-resultes are joined together to form the end result
15

Metadata
property -> data source (type)
16
Decomposing
User QuerySPARQL query
Database
XML
File
?item gho:Country ?country
.
?item gho:Disease ?disease
.
...
SELECT country, disease,
... FROM Observations
Finding Relevant Data Sources
+ Converting Queries
SQL XPathSQL
MongoDB
JSON
Path
SQL
XML
MongoDB

17
Database
XML
File
Results Reconciliation
Execution Plan
Links between sub-queries
MongoDB
Final Results

Big Data Europe Integrator Platform Launch
Wednesday 3 May @ 15:00 CEST
Please type your questions at any time. Q&A will follow the presentations

SANSA: Motivation
20
◎ Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were published.
Source: LOD-Cloud (http://lod-cloud.net/ )
◎ Now days hadoop ecosystem has become a standard for
BigData applications.
◎ We use this infrastructure for Semantic Web as well.

Thank You
@Gezim_Sejdiu
http://sda.cs.uni-bonn.de/gezim-sejdiu/
SANSA
Semantic Analytics Stack
https://github.com/SANSA-Stack
● SANSA-RDF
● SANSA-OWL
● SANSA-Query
● SANSA-Inference
● SANSA-ML
● SANSA-Examples

SANSA Planning
◎ Current state:
o SANSA 0.1 released in December
o Can read RDF/OWL files, compute statistics, simple
queries, lightweight inference, graph clustering, rule
mining
◎ Subsequent releases every 6 months
o SANSA 0.2 release planned in June 2017.
23

Conclusions and Next steps
◎ A generic stack for (big) Linked Data
o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
◎ Out-of-the-box framework for scalable and distributed
semantic data analysis combining semantic web and
distributed machine learning for (1) querying, (2)
inference and (3) analytics of RDF datasets.
◎ Next steps
o Refinement of data structures (RDF/OWL Layer)
o Add support for SPARQL 1.1 and other backend strategies (Query
Layer)
o Define a ML pipelines for Structured ML (ML Layer).
24

BDE & General Integration
◎ SANSA = Scala / Maven Repositories based on Spark /
Flink
◎ Easy to include both in BDE platform and any Spark /
Flink environment
25

Release webinar: Sansa and Ontario

More Related Content

What's hot

Similar to Release webinar: Sansa and Ontario

More from BigData_Europe

Recently uploaded

Release webinar: Sansa and Ontario

Editor's Notes