Distributed Knowledge Graph
Processing in SANSA and
Ontario
Prof. Dr. Jens Lehmann
Big Data Europe Platform ReleaseMay 3rd 2017
SANSA: Motivation
◎ Abundant machine readable structured information is
available (e.g. in RDF)
o Across BDE Societal Challenges, e.g. for Life Science Data
o General: DBpedia, Google knowledge graph
o Social graphs: Facebook, Twitter
◎ Need for scalable querying, inference and machine
learning
o Link prediction
o Knowledge base completion
o Predictive analytics
o Forward chaining inference
2
SANSA Stack
4
◎ SANSA includes several libraries:
o Read / Write RDF / OWL library
o Querying library
o Inference library
o ML- Machine Learning core library
http://sansa-stack.net/
◎ Framework for distributed RDF dataset processing
aiming at high scalability and fault tolerance
SANSA: Read Write Layer
5
SANSA: Read Write Layer
◎ Ingest RDF and OWL data in different formats using
Jena / OWL API style interfaces
◎ Goal: Represent/distribute data in multiple formats
(e.g. RDD, Data Frames, GraphX, Tensors)
◎ Goal: Allow transformation among these formats
◎ Compute dataset statistics and apply functions to URIs,
literals, subjects, objects → Distributed LODStats
6
SANSA: Query Layer
7
SANSA: Query Layer
◎ Improve performance of generic queries via:
o Intelligent indexing
o Splitting strategies
o Distributed Storage
◎ SPARQL-to-SQL approaches, Virtual Views
◎ Next Step: SPARQL query engine evaluation
◎ SANSA provides a W3C SPARQL compliant endpoint
8
SANSA: Inference Layer
9
SANSA: Inference Layer
◎ W3C Standards for Modelling: RDFS and OWL
◎ Parallel in-memory inference via rule-based forward
chaining
◎ Beyond state of the art: dynamically build a rule
dependency graph for a rule set
◎ → Adjustable performance levels
10
SANSA: ML Layer
11
SANSA: ML Layer
◎ Distributed Machine Learning (ML) algorithms that work
on RDF data and make use of its structure / semantics
◎ Work in Progress:
o Tensor Factorization for e.g. KB completion (testing stage)
o Graph Clustering (testing stage)
o Association rule mining (evaluation stage)
o Semantic Decision trees (idea stage)
o Inference in Knowledge Graph Embeddings (idea stage)
12
SANSA Status and Releases
◎ A generic stack for (big) Linked Data
o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
o Easy to use: just add the dependencies to your existing Spark or Flink
project
◎ Out-of-the-box framework for (1) querying, (2) inference
and (3) machine learning over RDF datasets
◎ 0.1 Release in December
◎ Subsequent releases every 6 months
13
Semantic Data Lake (Ontario)
◎ Data Lake or Swamp?
o Repository of data in its original formats
o Structured, semi-structured, unstructured
o Without unified schema
◎ Semantic Data Lake
o Add a Semantic Layer on top of the source datasets
❖ The data is semantically lifted using ontology
terms
❖ Provide a uniform view over nonuniform data
14
Semantic Data Lake (Ontario)
◎ A SPARQL query is decomposed into sub-queries
o Keeping links between sub-queries to collect data
returned later from each sub-query
◎ Data Lake is checked for candidate data sources that
can answer each sub-query (consulting metadata)
◎ Each SPARQL sub-query is translated to a query on the
query language of the selected candidate data source
o Examples: SPARQL to SQL, or SPARQL to CQL
◎ Sub-resultes are joined together to form the end result
15
Metadata
property -> data source (type)
Semantic Data Lake (Ontario)
16
Decomposing
User QuerySPARQL query
Database
XML
File
?item gho:Country ?country
.
?item gho:Disease ?disease
.
...
SELECT country, disease,
... FROM Observations
Finding Relevant Data Sources
+ Converting Queries
SQL XPathSQL
MongoDB
JSON
Path
SQL
XML
MongoDB
Semantic Data Lake (Ontario)
17
Database
XML
File
Results Reconciliation
Execution Plan
Links between sub-queries
MongoDB
Final Results
Big Data Europe Integrator Platform Launch
Wednesday 3 May @ 15:00 CEST
Please type your questions at any time. Q&A will follow the presentations
Backup Slides
SANSA: Motivation
20
◎ Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were published.
Source: LOD-Cloud (http://lod-cloud.net/ )
◎ Now days hadoop ecosystem has become a standard for
BigData applications.
◎ We use this infrastructure for Semantic Web as well.
Thank You
Thank You
@Gezim_Sejdiu
http://sda.cs.uni-bonn.de/gezim-sejdiu/
SANSA
Semantic Analytics Stack
https://github.com/SANSA-Stack
● SANSA-RDF
● SANSA-OWL
● SANSA-Query
● SANSA-Inference
● SANSA-ML
● SANSA-Examples
SANSA Planning
◎ Current state:
o SANSA 0.1 released in December
o Can read RDF/OWL files, compute statistics, simple
queries, lightweight inference, graph clustering, rule
mining
◎ Subsequent releases every 6 months
o SANSA 0.2 release planned in June 2017.
23
Conclusions and Next steps
◎ A generic stack for (big) Linked Data
o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
◎ Out-of-the-box framework for scalable and distributed
semantic data analysis combining semantic web and
distributed machine learning for (1) querying, (2)
inference and (3) analytics of RDF datasets.
◎ Next steps
o Refinement of data structures (RDF/OWL Layer)
o Add support for SPARQL 1.1 and other backend strategies (Query
Layer)
o Define a ML pipelines for Structured ML (ML Layer).
24
BDE & General Integration
◎ SANSA = Scala / Maven Repositories based on Spark /
Flink
◎ Easy to include both in BDE platform and any Spark /
Flink environment
25

Release webinar: Sansa and Ontario

  • 1.
    Distributed Knowledge Graph Processingin SANSA and Ontario Prof. Dr. Jens Lehmann Big Data Europe Platform ReleaseMay 3rd 2017
  • 2.
    SANSA: Motivation ◎ Abundantmachine readable structured information is available (e.g. in RDF) o Across BDE Societal Challenges, e.g. for Life Science Data o General: DBpedia, Google knowledge graph o Social graphs: Facebook, Twitter ◎ Need for scalable querying, inference and machine learning o Link prediction o Knowledge base completion o Predictive analytics o Forward chaining inference 2
  • 4.
    SANSA Stack 4 ◎ SANSAincludes several libraries: o Read / Write RDF / OWL library o Querying library o Inference library o ML- Machine Learning core library http://sansa-stack.net/ ◎ Framework for distributed RDF dataset processing aiming at high scalability and fault tolerance
  • 5.
  • 6.
    SANSA: Read WriteLayer ◎ Ingest RDF and OWL data in different formats using Jena / OWL API style interfaces ◎ Goal: Represent/distribute data in multiple formats (e.g. RDD, Data Frames, GraphX, Tensors) ◎ Goal: Allow transformation among these formats ◎ Compute dataset statistics and apply functions to URIs, literals, subjects, objects → Distributed LODStats 6
  • 7.
  • 8.
    SANSA: Query Layer ◎Improve performance of generic queries via: o Intelligent indexing o Splitting strategies o Distributed Storage ◎ SPARQL-to-SQL approaches, Virtual Views ◎ Next Step: SPARQL query engine evaluation ◎ SANSA provides a W3C SPARQL compliant endpoint 8
  • 9.
  • 10.
    SANSA: Inference Layer ◎W3C Standards for Modelling: RDFS and OWL ◎ Parallel in-memory inference via rule-based forward chaining ◎ Beyond state of the art: dynamically build a rule dependency graph for a rule set ◎ → Adjustable performance levels 10
  • 11.
  • 12.
    SANSA: ML Layer ◎Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics ◎ Work in Progress: o Tensor Factorization for e.g. KB completion (testing stage) o Graph Clustering (testing stage) o Association rule mining (evaluation stage) o Semantic Decision trees (idea stage) o Inference in Knowledge Graph Embeddings (idea stage) 12
  • 13.
    SANSA Status andReleases ◎ A generic stack for (big) Linked Data o Build on top of a state-of-the-art distributed frameworks (Spark, Flink) o Easy to use: just add the dependencies to your existing Spark or Flink project ◎ Out-of-the-box framework for (1) querying, (2) inference and (3) machine learning over RDF datasets ◎ 0.1 Release in December ◎ Subsequent releases every 6 months 13
  • 14.
    Semantic Data Lake(Ontario) ◎ Data Lake or Swamp? o Repository of data in its original formats o Structured, semi-structured, unstructured o Without unified schema ◎ Semantic Data Lake o Add a Semantic Layer on top of the source datasets ❖ The data is semantically lifted using ontology terms ❖ Provide a uniform view over nonuniform data 14
  • 15.
    Semantic Data Lake(Ontario) ◎ A SPARQL query is decomposed into sub-queries o Keeping links between sub-queries to collect data returned later from each sub-query ◎ Data Lake is checked for candidate data sources that can answer each sub-query (consulting metadata) ◎ Each SPARQL sub-query is translated to a query on the query language of the selected candidate data source o Examples: SPARQL to SQL, or SPARQL to CQL ◎ Sub-resultes are joined together to form the end result 15
  • 16.
    Metadata property -> datasource (type) Semantic Data Lake (Ontario) 16 Decomposing User QuerySPARQL query Database XML File ?item gho:Country ?country . ?item gho:Disease ?disease . ... SELECT country, disease, ... FROM Observations Finding Relevant Data Sources + Converting Queries SQL XPathSQL MongoDB JSON Path SQL XML MongoDB
  • 17.
    Semantic Data Lake(Ontario) 17 Database XML File Results Reconciliation Execution Plan Links between sub-queries MongoDB Final Results
  • 18.
    Big Data EuropeIntegrator Platform Launch Wednesday 3 May @ 15:00 CEST Please type your questions at any time. Q&A will follow the presentations
  • 19.
  • 20.
    SANSA: Motivation 20 ◎ Overthe last years, the size of the Semantic Web has increased and several large-scale datasets were published. Source: LOD-Cloud (http://lod-cloud.net/ ) ◎ Now days hadoop ecosystem has become a standard for BigData applications. ◎ We use this infrastructure for Semantic Web as well.
  • 21.
  • 22.
    Thank You @Gezim_Sejdiu http://sda.cs.uni-bonn.de/gezim-sejdiu/ SANSA Semantic AnalyticsStack https://github.com/SANSA-Stack ● SANSA-RDF ● SANSA-OWL ● SANSA-Query ● SANSA-Inference ● SANSA-ML ● SANSA-Examples
  • 23.
    SANSA Planning ◎ Currentstate: o SANSA 0.1 released in December o Can read RDF/OWL files, compute statistics, simple queries, lightweight inference, graph clustering, rule mining ◎ Subsequent releases every 6 months o SANSA 0.2 release planned in June 2017. 23
  • 24.
    Conclusions and Nextsteps ◎ A generic stack for (big) Linked Data o Build on top of a state-of-the-art distributed frameworks (Spark, Flink) ◎ Out-of-the-box framework for scalable and distributed semantic data analysis combining semantic web and distributed machine learning for (1) querying, (2) inference and (3) analytics of RDF datasets. ◎ Next steps o Refinement of data structures (RDF/OWL Layer) o Add support for SPARQL 1.1 and other backend strategies (Query Layer) o Define a ML pipelines for Structured ML (ML Layer). 24
  • 25.
    BDE & GeneralIntegration ◎ SANSA = Scala / Maven Repositories based on Spark / Flink ◎ Easy to include both in BDE platform and any Spark / Flink environment 25

Editor's Notes

  • #3 The traditional Machine learning methods operate on simple feature vector based input and lack expressive and understandable outcomes. The methods like statistical relational learning and Inductive logic are expressive but lack scalability
  • #5 Goal: Build a semantic analytics stack which allows you to perform distributed: Querying Inference Analytics on RDF datasets Existing Open Source libraries either: Lack Community Lack Documentation and Examples Lack Scalability Or are research-oriented Working: A Library built on top of Spark and Flink Consists of several APIs Read / Write: RDF / OWL API for RDF/OWL operations, Querying API: supports a query language on top of distributed RDF library, Inference API : Implementation of a rule based reasoner using the querying API ML- Machine Learning Core Library