Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Release webinar: Sansa and Ontario


Published on

Jens Lehmann's overview of the use of semantics in the Big Data Europe Integrator Platform. Including the Semantic Data Lake (Ontario), and the SANSA Analytics Engine.

Published in: Data & Analytics
  • Be the first to comment

Release webinar: Sansa and Ontario

  1. 1. Distributed Knowledge Graph Processing in SANSA and Ontario Prof. Dr. Jens Lehmann Big Data Europe Platform ReleaseMay 3rd 2017
  2. 2. SANSA: Motivation ◎ Abundant machine readable structured information is available (e.g. in RDF) o Across BDE Societal Challenges, e.g. for Life Science Data o General: DBpedia, Google knowledge graph o Social graphs: Facebook, Twitter ◎ Need for scalable querying, inference and machine learning o Link prediction o Knowledge base completion o Predictive analytics o Forward chaining inference 2
  3. 3. SANSA Stack 4 ◎ SANSA includes several libraries: o Read / Write RDF / OWL library o Querying library o Inference library o ML- Machine Learning core library ◎ Framework for distributed RDF dataset processing aiming at high scalability and fault tolerance
  4. 4. SANSA: Read Write Layer 5
  5. 5. SANSA: Read Write Layer ◎ Ingest RDF and OWL data in different formats using Jena / OWL API style interfaces ◎ Goal: Represent/distribute data in multiple formats (e.g. RDD, Data Frames, GraphX, Tensors) ◎ Goal: Allow transformation among these formats ◎ Compute dataset statistics and apply functions to URIs, literals, subjects, objects → Distributed LODStats 6
  6. 6. SANSA: Query Layer 7
  7. 7. SANSA: Query Layer ◎ Improve performance of generic queries via: o Intelligent indexing o Splitting strategies o Distributed Storage ◎ SPARQL-to-SQL approaches, Virtual Views ◎ Next Step: SPARQL query engine evaluation ◎ SANSA provides a W3C SPARQL compliant endpoint 8
  8. 8. SANSA: Inference Layer 9
  9. 9. SANSA: Inference Layer ◎ W3C Standards for Modelling: RDFS and OWL ◎ Parallel in-memory inference via rule-based forward chaining ◎ Beyond state of the art: dynamically build a rule dependency graph for a rule set ◎ → Adjustable performance levels 10
  10. 10. SANSA: ML Layer 11
  11. 11. SANSA: ML Layer ◎ Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics ◎ Work in Progress: o Tensor Factorization for e.g. KB completion (testing stage) o Graph Clustering (testing stage) o Association rule mining (evaluation stage) o Semantic Decision trees (idea stage) o Inference in Knowledge Graph Embeddings (idea stage) 12
  12. 12. SANSA Status and Releases ◎ A generic stack for (big) Linked Data o Build on top of a state-of-the-art distributed frameworks (Spark, Flink) o Easy to use: just add the dependencies to your existing Spark or Flink project ◎ Out-of-the-box framework for (1) querying, (2) inference and (3) machine learning over RDF datasets ◎ 0.1 Release in December ◎ Subsequent releases every 6 months 13
  13. 13. Semantic Data Lake (Ontario) ◎ Data Lake or Swamp? o Repository of data in its original formats o Structured, semi-structured, unstructured o Without unified schema ◎ Semantic Data Lake o Add a Semantic Layer on top of the source datasets ❖ The data is semantically lifted using ontology terms ❖ Provide a uniform view over nonuniform data 14
  14. 14. Semantic Data Lake (Ontario) ◎ A SPARQL query is decomposed into sub-queries o Keeping links between sub-queries to collect data returned later from each sub-query ◎ Data Lake is checked for candidate data sources that can answer each sub-query (consulting metadata) ◎ Each SPARQL sub-query is translated to a query on the query language of the selected candidate data source o Examples: SPARQL to SQL, or SPARQL to CQL ◎ Sub-resultes are joined together to form the end result 15
  15. 15. Metadata property -> data source (type) Semantic Data Lake (Ontario) 16 Decomposing User QuerySPARQL query Database XML File ?item gho:Country ?country . ?item gho:Disease ?disease . ... SELECT country, disease, ... FROM Observations Finding Relevant Data Sources + Converting Queries SQL XPathSQL MongoDB JSON Path SQL XML MongoDB
  16. 16. Semantic Data Lake (Ontario) 17 Database XML File Results Reconciliation Execution Plan Links between sub-queries MongoDB Final Results
  17. 17. Big Data Europe Integrator Platform Launch Wednesday 3 May @ 15:00 CEST Please type your questions at any time. Q&A will follow the presentations
  18. 18. Backup Slides
  19. 19. SANSA: Motivation 20 ◎ Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published. Source: LOD-Cloud ( ) ◎ Now days hadoop ecosystem has become a standard for BigData applications. ◎ We use this infrastructure for Semantic Web as well.
  20. 20. Thank You
  21. 21. Thank You @Gezim_Sejdiu SANSA Semantic Analytics Stack ● SANSA-RDF ● SANSA-OWL ● SANSA-Query ● SANSA-Inference ● SANSA-ML ● SANSA-Examples
  22. 22. SANSA Planning ◎ Current state: o SANSA 0.1 released in December o Can read RDF/OWL files, compute statistics, simple queries, lightweight inference, graph clustering, rule mining ◎ Subsequent releases every 6 months o SANSA 0.2 release planned in June 2017. 23
  23. 23. Conclusions and Next steps ◎ A generic stack for (big) Linked Data o Build on top of a state-of-the-art distributed frameworks (Spark, Flink) ◎ Out-of-the-box framework for scalable and distributed semantic data analysis combining semantic web and distributed machine learning for (1) querying, (2) inference and (3) analytics of RDF datasets. ◎ Next steps o Refinement of data structures (RDF/OWL Layer) o Add support for SPARQL 1.1 and other backend strategies (Query Layer) o Define a ML pipelines for Structured ML (ML Layer). 24
  24. 24. BDE & General Integration ◎ SANSA = Scala / Maven Repositories based on Spark / Flink ◎ Easy to include both in BDE platform and any Spark / Flink environment 25