Daniel Janke and Steffen Staab. Tutorial at Reasoning Web
With proliferation of semantic data, there is a need to cope with trillions of triples by horizontally scaling data management in the cloud. To this end one needs to advance (i) strategies for data placement over compute and storage nodes, (ii) strategies for distributed query processing, and (iii) strategies for handling failure of compute and storage nodes. In this tutorial, we want to review challenges and how they have been addressed by research and development in the last 15 years.
This document describes Doc2Graph, an open source tool that transforms JSON documents into a graph database. It discusses how Doc2Graph works, including converting JSON trees into a graph and reusing existing nodes. It also provides examples of using Doc2Graph with CouchbaseDB, MongoDB, and the Spotify API to import music data into Neo4j. The document concludes with information on Doc2Graph's configuration options.
Integration of data ninja services with oracle spatial and graphData Ninja API
Data Ninja Services provides a set of cloud-based APIs that can extract entities from the document texts as well as their relationships, and produce RDF triples which can be populated into an Oracle Spatial and Graph in a seamless integration. The risk analysis case study based on the Zika virus binds actionable insights from Oracle with the semantic content produced by the Data Ninja services.
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
This document outlines a presentation about transforming metadata from a CONTENTdm digital collection into linked data. It discusses the concepts of linked data, including defining linked data, linked data principles, technologies and standards. It then explains how these concepts can be applied to digital collection records, including anticipated challenges working with CONTENTdm. The document describes a linked data project at UNLV Libraries to transform collection records into linked data and publish it on the linked data cloud. It provides tips for creating metadata that is more suitable for linked data.
Semantic Technologies and Triplestores for Business IntelligenceMarin Dimitrov
This document provides an introduction to semantic technologies and triplestores. It discusses the Semantic Web vision of making data on the web more accessible and linked. Key concepts covered include RDF, ontologies, OWL, SPARQL and Linked Data. It also introduces triplestores as RDF databases for storing and querying semantic data and compares their features to traditional databases.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
The document discusses linked data, ontologies, and inference. It provides examples of using RDFS and OWL to infer new facts from schemas and ontologies. Key points include:
- Linked Data uses URIs and HTTP to identify things and provide useful information about them via standards like RDF and SPARQL.
- Projects like LOD aim to develop best practices for publishing interlinked open datasets. FactForge and LinkedLifeData are examples that contain billions of statements across life science and general knowledge datasets.
- RDFS and OWL allow defining schemas and ontologies that enable inferring new facts through reasoning. Rules like rdfs:domain and rdfs:range allow inferring type information
This document describes Doc2Graph, an open source tool that transforms JSON documents into a graph database. It discusses how Doc2Graph works, including converting JSON trees into a graph and reusing existing nodes. It also provides examples of using Doc2Graph with CouchbaseDB, MongoDB, and the Spotify API to import music data into Neo4j. The document concludes with information on Doc2Graph's configuration options.
Integration of data ninja services with oracle spatial and graphData Ninja API
Data Ninja Services provides a set of cloud-based APIs that can extract entities from the document texts as well as their relationships, and produce RDF triples which can be populated into an Oracle Spatial and Graph in a seamless integration. The risk analysis case study based on the Zika virus binds actionable insights from Oracle with the semantic content produced by the Data Ninja services.
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
This document outlines a presentation about transforming metadata from a CONTENTdm digital collection into linked data. It discusses the concepts of linked data, including defining linked data, linked data principles, technologies and standards. It then explains how these concepts can be applied to digital collection records, including anticipated challenges working with CONTENTdm. The document describes a linked data project at UNLV Libraries to transform collection records into linked data and publish it on the linked data cloud. It provides tips for creating metadata that is more suitable for linked data.
Semantic Technologies and Triplestores for Business IntelligenceMarin Dimitrov
This document provides an introduction to semantic technologies and triplestores. It discusses the Semantic Web vision of making data on the web more accessible and linked. Key concepts covered include RDF, ontologies, OWL, SPARQL and Linked Data. It also introduces triplestores as RDF databases for storing and querying semantic data and compares their features to traditional databases.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
The document discusses linked data, ontologies, and inference. It provides examples of using RDFS and OWL to infer new facts from schemas and ontologies. Key points include:
- Linked Data uses URIs and HTTP to identify things and provide useful information about them via standards like RDF and SPARQL.
- Projects like LOD aim to develop best practices for publishing interlinked open datasets. FactForge and LinkedLifeData are examples that contain billions of statements across life science and general knowledge datasets.
- RDFS and OWL allow defining schemas and ontologies that enable inferring new facts through reasoning. Rules like rdfs:domain and rdfs:range allow inferring type information
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
This presentation about Hadoop will help you understand what is Big Data, what is Hadoop, how Hadoop came into existence, what are the various components of Hadoop and an explanation on Hadoop use case. In the current time, there is a lot of data being generated every day and this massive amount of data cannot be stored, processed and analyzed using the traditional ways. That is why Hadoop can into existence as a solution for Big Data. Hadoop is a framework that manages Big Data storage in a distributed way and processes it parallelly. Now, let us get started and understand the importance of Hadoop and why we actually need it.
Below topics are explained in this Hadoop presentation:
1. The rise of Big Data
2. What is Big Data?
3. Big Data and its challenges
4. Hadoop as a solution
5. What is Hadoop?
6. Components of Hadoop
7. Use case of Hadoop
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
Presentation by Tony Hammond and Michele Pasin to Linked Science workshop, co-located with International Semantic Web Conference (ISWC) 2015, on October 12, 2015
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
The document provides information about Hadoop training. It discusses the need for Hadoop in today's data-heavy world. It then describes what Hadoop is, its ecosystem including HDFS for storage and MapReduce for processing. It also discusses YARN and provides a bank use case. It further explains the architecture and working of HDFS and MapReduce in processing large datasets in parallel across clusters.
Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
This document discusses approaches for mining frequent item sets on Apache Hadoop. It begins with an introduction to data mining and association rule mining. Association rule mining involves finding frequent item sets, which are items that frequently occur together. Apache Hadoop is then introduced as a framework for distributed processing of large datasets. Several algorithms for mining frequent item sets are discussed, including Apriori, FP-Growth, and H-mine. These algorithms differ in how they generate and count candidate item sets. The document then discusses how these algorithms can be implemented on Hadoop to take advantage of its distributed and parallel processing abilities in order to efficiently mine frequent item sets from large datasets.
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
Talend Open Studio (TOS) is a wonderful open source Data Integration (DI) tool used to build end-to-end ETL solutions. This course will not only help the beginners to understand the art of data integration but also equip them with Big Data skills in the smart way. This course also aims to educate you about Big Data through Talend's powerful product "Talend for Big Data" (the first Hadoop-based data integration platform). The topics covered in the presentation are:
1. Why ETL is still essential and arrival of Big Data is not the doom of ETL era
2.How and why ETL is using Talend
3.Talend complementing Hadoop Ecosystem? Adopting to ETL-Big Data industry
4.Learn Big Data not in months but in Minutes! Sounds too good?
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
Microtask Crowdsourcing Applications for Linked DataEUCLID project
This document discusses using microtask crowdsourcing to enhance linked data applications. It describes how crowdsourcing can be used in various components of the linked data integration process, including data cleansing, vocabulary mapping, and entity interlinking. Specific crowdsourcing applications and systems are discussed that address tasks like assessing the quality of DBpedia triples, entity linking with ZenCrowd, and understanding natural language queries with CrowdQ. The results show that crowdsourcing can often improve the results of automated techniques for various linked data tasks and help integrate and enhance large linked data sources.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
While Hadoop is great for data transformation, it poses challenges for data science. The document discusses how a unified environment using Apache Spark, PySpark, SparkSQL and IPython Notebook can overcome these challenges by providing a single environment for local and distributed processing using popular data science languages, strong SQL integration, and the ability to visualize and report results. The document demonstrates this environment by exploring an open payments dataset between doctors/hospitals and healthcare companies using Python and Spark in IPython Notebook on Hadoop.
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
This presentation introduces the main principles of Linked Data, the underlying technologies and background standards. It provides basic knowledge for how data can be published over the Web, how it can be queried, and what are the possible use cases and benefits. As an example, we use the development of a music portal (based on the MusicBrainz dataset), which facilitates access to a wide range of information and multimedia resources relating to music.
This document discusses big data, Hadoop, data science, and why Hadoop is useful for data science. It begins with defining big data and the 3 V's of big data. It then explains what Hadoop is and how it works using HDFS for storage and MapReduce for processing. The document defines what a data product is and provides examples. It defines data science as extracting meaning from data and building data products. Finally, it argues that Hadoop is useful for data science because it allows exploration of full datasets, mining of larger datasets, large-scale data preparation, and can accelerate data-driven innovation by removing speed barriers of traditional architectures.
Information Extraction and Linked Data CloudDhaval Thakker
The document discusses Press Association's semantic technology project which aims to generate a knowledge base using information extraction and the Linked Data Cloud. It outlines Press Association's operations and workflow, and how semantic technologies can be used to develop taxonomies, annotate images, and extract entities from captions into an ontology-based knowledge base. The knowledge base can then be populated and interlinked with external datasets from the Linked Data Cloud like DBpedia to provide a comprehensive, semantically-structured source of information.
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
This Edureka "What is Hadoop" Tutorial (check our hadoop blog series here: https://goo.gl/lQKjL8) will help you understand all the basics of Hadoop. Learn about the differences in traditional and hadoop way of storing and processing data in detail. Below are the topics covered in this tutorial:
1) Traditional Way of Processing - SEARS
2) Big Data Growth Drivers
3) Problem Associated with Big Data
4) Hadoop: Solution to Big Data Problem
5) What is Hadoop?
6) HDFS
7) MapReduce
8) Hadoop Ecosystem
9) Demo: Hadoop Case Study - Orbitz
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
Packages for data wrangling データ前処理のためのパッケージHiroki K
The document discusses several R packages for data wrangling (preprocessing) tasks. It provides a table with information on popular packages like plyr, reshape2, stringr, lubridate, sqldf, dplyr, data.table, and zoo. While dplyr is commonly used, the document focuses on introducing the plyr package, which can still be useful when working with list-type data. Examples show how to use plyr functions like llply and ddply to apply operations to multiple objects or subsets of data.
This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.
The LinkedGov extension allows users to clean, enrich, and link public data that exists in spreadsheets and other formats within Google Refine. It provides tools to transform the data into a machine-readable format and then link it to other public datasets. The cleaned and linked data is stored in the LinkedGov database and made available via its question site and public SPARQL endpoint for further analysis and reuse.
Jens Lehmann's overview of the use of semantics in the Big Data Europe Integrator Platform. Including the Semantic Data Lake (Ontario), and the SANSA Analytics Engine.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
This presentation about Hadoop will help you understand what is Big Data, what is Hadoop, how Hadoop came into existence, what are the various components of Hadoop and an explanation on Hadoop use case. In the current time, there is a lot of data being generated every day and this massive amount of data cannot be stored, processed and analyzed using the traditional ways. That is why Hadoop can into existence as a solution for Big Data. Hadoop is a framework that manages Big Data storage in a distributed way and processes it parallelly. Now, let us get started and understand the importance of Hadoop and why we actually need it.
Below topics are explained in this Hadoop presentation:
1. The rise of Big Data
2. What is Big Data?
3. Big Data and its challenges
4. Hadoop as a solution
5. What is Hadoop?
6. Components of Hadoop
7. Use case of Hadoop
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
Presentation by Tony Hammond and Michele Pasin to Linked Science workshop, co-located with International Semantic Web Conference (ISWC) 2015, on October 12, 2015
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
The document provides information about Hadoop training. It discusses the need for Hadoop in today's data-heavy world. It then describes what Hadoop is, its ecosystem including HDFS for storage and MapReduce for processing. It also discusses YARN and provides a bank use case. It further explains the architecture and working of HDFS and MapReduce in processing large datasets in parallel across clusters.
Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
This document discusses approaches for mining frequent item sets on Apache Hadoop. It begins with an introduction to data mining and association rule mining. Association rule mining involves finding frequent item sets, which are items that frequently occur together. Apache Hadoop is then introduced as a framework for distributed processing of large datasets. Several algorithms for mining frequent item sets are discussed, including Apriori, FP-Growth, and H-mine. These algorithms differ in how they generate and count candidate item sets. The document then discusses how these algorithms can be implemented on Hadoop to take advantage of its distributed and parallel processing abilities in order to efficiently mine frequent item sets from large datasets.
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
Talend Open Studio (TOS) is a wonderful open source Data Integration (DI) tool used to build end-to-end ETL solutions. This course will not only help the beginners to understand the art of data integration but also equip them with Big Data skills in the smart way. This course also aims to educate you about Big Data through Talend's powerful product "Talend for Big Data" (the first Hadoop-based data integration platform). The topics covered in the presentation are:
1. Why ETL is still essential and arrival of Big Data is not the doom of ETL era
2.How and why ETL is using Talend
3.Talend complementing Hadoop Ecosystem? Adopting to ETL-Big Data industry
4.Learn Big Data not in months but in Minutes! Sounds too good?
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
Microtask Crowdsourcing Applications for Linked DataEUCLID project
This document discusses using microtask crowdsourcing to enhance linked data applications. It describes how crowdsourcing can be used in various components of the linked data integration process, including data cleansing, vocabulary mapping, and entity interlinking. Specific crowdsourcing applications and systems are discussed that address tasks like assessing the quality of DBpedia triples, entity linking with ZenCrowd, and understanding natural language queries with CrowdQ. The results show that crowdsourcing can often improve the results of automated techniques for various linked data tasks and help integrate and enhance large linked data sources.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
While Hadoop is great for data transformation, it poses challenges for data science. The document discusses how a unified environment using Apache Spark, PySpark, SparkSQL and IPython Notebook can overcome these challenges by providing a single environment for local and distributed processing using popular data science languages, strong SQL integration, and the ability to visualize and report results. The document demonstrates this environment by exploring an open payments dataset between doctors/hospitals and healthcare companies using Python and Spark in IPython Notebook on Hadoop.
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
This presentation introduces the main principles of Linked Data, the underlying technologies and background standards. It provides basic knowledge for how data can be published over the Web, how it can be queried, and what are the possible use cases and benefits. As an example, we use the development of a music portal (based on the MusicBrainz dataset), which facilitates access to a wide range of information and multimedia resources relating to music.
This document discusses big data, Hadoop, data science, and why Hadoop is useful for data science. It begins with defining big data and the 3 V's of big data. It then explains what Hadoop is and how it works using HDFS for storage and MapReduce for processing. The document defines what a data product is and provides examples. It defines data science as extracting meaning from data and building data products. Finally, it argues that Hadoop is useful for data science because it allows exploration of full datasets, mining of larger datasets, large-scale data preparation, and can accelerate data-driven innovation by removing speed barriers of traditional architectures.
Information Extraction and Linked Data CloudDhaval Thakker
The document discusses Press Association's semantic technology project which aims to generate a knowledge base using information extraction and the Linked Data Cloud. It outlines Press Association's operations and workflow, and how semantic technologies can be used to develop taxonomies, annotate images, and extract entities from captions into an ontology-based knowledge base. The knowledge base can then be populated and interlinked with external datasets from the Linked Data Cloud like DBpedia to provide a comprehensive, semantically-structured source of information.
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
This Edureka "What is Hadoop" Tutorial (check our hadoop blog series here: https://goo.gl/lQKjL8) will help you understand all the basics of Hadoop. Learn about the differences in traditional and hadoop way of storing and processing data in detail. Below are the topics covered in this tutorial:
1) Traditional Way of Processing - SEARS
2) Big Data Growth Drivers
3) Problem Associated with Big Data
4) Hadoop: Solution to Big Data Problem
5) What is Hadoop?
6) HDFS
7) MapReduce
8) Hadoop Ecosystem
9) Demo: Hadoop Case Study - Orbitz
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
Packages for data wrangling データ前処理のためのパッケージHiroki K
The document discusses several R packages for data wrangling (preprocessing) tasks. It provides a table with information on popular packages like plyr, reshape2, stringr, lubridate, sqldf, dplyr, data.table, and zoo. While dplyr is commonly used, the document focuses on introducing the plyr package, which can still be useful when working with list-type data. Examples show how to use plyr functions like llply and ddply to apply operations to multiple objects or subsets of data.
This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.
The LinkedGov extension allows users to clean, enrich, and link public data that exists in spreadsheets and other formats within Google Refine. It provides tools to transform the data into a machine-readable format and then link it to other public datasets. The cleaned and linked data is stored in the LinkedGov database and made available via its question site and public SPARQL endpoint for further analysis and reuse.
Jens Lehmann's overview of the use of semantics in the Big Data Europe Integrator Platform. Including the Semantic Data Lake (Ontario), and the SANSA Analytics Engine.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
The document discusses the objectives and outcomes of the FAIRport Skunkworks team so far. The team is exploring existing technologies to build prototype FAIRport code components using existing standards. They aim to enable findable, accessible, interoperable, and reusable data across repositories. However, repositories use different metadata schemas and standards like DCAT in incomplete ways. The team proposes "FAIR Profiles" - a generic way to describe metadata fields and constraints for any repository using a standardized vocabulary and structure. This would enable rich queries across repositories. They define a FAIR Profile Schema to serve as a lightweight meta-meta-descriptor for describing diverse repository metadata schemas in a consistent way.
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
This document summarizes a presentation about scaling terabytes of data with Apache Spark and Scala. The key points are:
1) The presenter discusses how to use Apache Spark and Scala to process large scale data in a distributed manner across clusters. Spark operations like RDDs, DataFrames and Datasets are covered.
2) A case study is presented about reengineering a data processing platform for a retail business to improve performance. Changes included parallelizing jobs, tuning Spark hyperparameters, and building a fast data architecture using Spark, Kafka and data lakes.
3) Performance was improved through techniques like dynamic resource allocation in YARN, reducing memory and cores per executor to better utilize cluster resources, and processing data
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
It would be useful to be able to discover what kinds of data are contained in the myriad general-purpose public data repositories. It would be even better if it were possible to query that data and/or have that data conform to a particular context-dependent data format. This was the ambition of the Data FAIRport project. I will be demonstrating the "strawman" demonstration of a fully-functional Data FAIRport, where the meta/data in a public repository can be "projected" into one of a number of different context-dependent formats, such that it can be cross-queried in combination with the (potentially "projected") data from other repositories.
Big data analysis using spark r publishedDipendra Kusi
SparkR enables large scale data analysis from R by leveraging Apache Spark's distributed processing capabilities. It allows users to load large datasets from sources like HDFS, run operations like filtering and aggregation in parallel, and build machine learning models like k-means clustering. SparkR also supports data visualization and exploration through packages like ggplot2. By running R programs on Spark, users can analyze datasets that are too large for a single machine.
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets.
In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
The document discusses the relationship between building information modeling (BIM) and the semantic web. It provides an introduction to linked data and describes how semantic web technologies can be used to add contextual and background knowledge to BIM data, such as geographical, historical, and statistical information. It also addresses challenges around preserving and maintaining the evolution of linked BIM and architecture data on the semantic web.
Scala: the unpredicted lingua franca for data scienceAndy Petrella
Talk given at Strata London with Dean Wampler (Lightbend) about Scala as the future of Data Science. First part is an approach of how scala became important, the remaining part of the talk is in notebooks using the Spark Notebook (http://spark-notebook.io/).
The notebooks are available on GitHub: https://github.com/data-fellas/scala-for-data-science.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
The document is a presentation by Jongwook Woo from the High-Performance Information Computing Center (HiPIC) at California State University Los Angeles given on February 25, 2017 at the SWRC conference in San Diego, CA. It discusses big data trends with open platforms and provides information on Spark, Hadoop, open data, use cases, and the future of big data. Specifically, it summarizes Jongwook Woo's background and experience, describes what big data is and how Spark improves on Hadoop MapReduce, discusses how Spark can integrate with Hadoop ecosystems, and provides examples of analyzing local business data using Spark.
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
TileDB webinars - Nov 4, 2021
The document summarizes a webinar about TileDB, a universal data management platform that represents data as dense and sparse multi-dimensional arrays. It addresses the data management problems in population genomics by storing variant call data as 3D sparse arrays. TileDB provides a unified storage and serverless computing model that allows efficient data access and analysis at global scale through its open source TileDB Embedded storage and TileDB Cloud platform. The webinar highlights how TileDB solves data production, distribution, and consumption problems and empowers data sharing and collaboration through its marketplace and security features.
This document discusses demos and tools for linking knowledge discovery (KDD) and linked data. It summarizes several tools that integrate linked data and KDD processes like data preprocessing, mining, and postprocessing. OpenRefine, RapidMiner, R, Matlab, ProLOD++, DL-Learner, Spark, KNIME, and Gephi were highlighted as tools that support tasks like enriching data, running SPARQL queries, loading RDF data, and visualizing linked data. The document concludes by asking about gaps and how to increase adoption, noting linked data could benefit KDD with validation, enrichment, and reasoning over semantic web data.
Databases have been around for decades and were highly optimised for data aggregations during that time. Not only Big data has changed the landscape of databases massively in the past years - we nowadays can find many Open Source projects among the most popular dbs.
After this talk you will be enabled to decide if a database can make your work more efficient and which direction to look to.
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Mark Wilkinson
A discussion and demonstration of a functional Data FAIRport, using W3C's Linked Data Platform, Ruben Verborgh's Linked Data Fragments, and Hydra's hypermedia controlled vocabularies. This is the output of the "Skunkworks" working group of the larger Data FAIRport project (http://datafairport.org).
Rajeev kumar apache_spark & scala developerRajeev Kumar
Rajeev Kumar is an experienced Apache Spark and Scala developer based in Amsterdam, NL. He has over 8 years of experience working with big data technologies like Apache Spark, Scala, Java, Hadoop, and data integration tools. He is proficient in processing large structured and unstructured datasets to identify patterns and gain insights. His experience includes designing and developing Spark applications using Scala, ETL processes, data warehousing, and working with technologies like Hive, HDFS, MapReduce, Sqoop, Kafka and more.
Similar to Storing and Querying Semantic Data in the Cloud (20)
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
Data spaces in distributed environments should be allowed to evolve in agile ways providing data space owners with large flexibility about which data they store. Agility and heterogeneity, however, jeopardize data exchanges because representations may build on varying ontologies and data consumers may not rely on the semantic correctness of their queries in the context of semantically heterogeneous, evolving data spaces. Graph data spaces are one example of a powerful model for representing and querying data whose semantics may change over time. To assert and enforce conditions on individual graph data spaces, shape languages (e.g SHACL) have been developed. We investigate the question of how querying and programming can be guarded by reasoning over SHACL constraints in a distributed setting and we sketch a picture of how a future landscape based on semantically heterogeneous data spaces might look like.
Knowledge graphs for knowing more and knowing for sureSteffen Staab
Knowledge graphs have been conceived to collect heterogeneous data and knowledge about large domains, e.g. medical or engineering domains, and to allow versatile access to such collections by means of querying and logical reasoning. A surge of methods has responded to additional requirements in recent years. (i) Knowledge graph embeddings use similarity and analogy of structures to speculatively add to the collected data and knowledge. (ii) Queries with shapes and schema information can be typed to provide certainty about results. We survey both developments and find that the development of techniques happens in disjoint communities that mostly do not understand each other, thus limiting the proper and most versatile use of knowledge graphs.
Symbolic Background Knowledge for Machine LearningSteffen Staab
Machine learning aims at learning complex functions from data. Very often, this challenge remains ill-defined given the available amount of data, however, background knowledge that is available as knowledge graphs, ontologies or symbolic (physical) equations allows for an improved specification of the targeted solution. In this talk, we want to discuss several use cases that include symbolic background knowledge as regularizing priors, as constraints or as other inductive biases into machine learning tasks.
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Steffen Staab
Präsentation von Oul Han und Steffen Staab
Workshop "Soziale Netzwerke und Medien" auf dem Treffen des Fakultätentags Informatik, 14. November 2019, Hamburg
Web Futures: Inclusive, Intelligent, SustainableSteffen Staab
Almost from its very beginning, the Web has been ambivalent.
It has facilitated freedom for information, but this also included the freedom to spread misinformation. It has faciliated intelligent personalization, but at the cost of intrusion into our private lifes. It has included more people than any other system before, but at the risk of exploiting them.
The Web is full of such ambivalences and the usage of artificial intelligences threatens to further amplify these ambivalences. To further the good and to contain the negative consequences, we need a research agenda studying and engineering the Web, as well as numerous activities by societies at large. In this talk, I will present and discuss a joint effort by an interdisciplinary team of Web Scientists to prepare and pursue such an agenda.
This document summarizes Steffen Staab's keynote presentation on eye tracking and web interaction. It discusses how eye tracking can be used to understand how users interact with and understand websites. It presents a framework for discovering active visual stimuli on websites using eye tracking data and machine learning. It also introduces GazeTheWeb, a system that aims to optimize gaze-based interaction with websites by adapting the interaction based on semantic understanding of page elements and dynamics. A lab study found that GazeTheWeb improved task completion times, usability and workload compared to traditional gaze emulation.
Concepts in Application Context ( How we may think conceptually )Steffen Staab
Formal concept analysis (FCA) derives a hierarchy of concepts
in a formal context that relates objects with attributes. This approach is very well aligned with the traditions of Frege, Saussure and Peirce, which relate a signifier (e.g. a word/an attribute) to a mental concept evoked by this word and meant to refer to a specific object in the real world. However, in the practice of natural languages as well as artificial languages (e.g. programming languages), the application context
often constitutes a latent variable that influences the interpretation of a signifier. We present some of our current work that analyzes the usage of words in natural language in varying application contexts as well as the usage of variables in programming languages in varying application contexts in order to provide conceptual constraints on these signifiers.
Talk at Leopoldina Symposium on Digitization and its Effects on Man and Society
(Die Digitalisierung und ihre Auswirkungen auf Mensch und Gesellschaft)
leopoldina.org/de/veranstaltungen/veranstaltung/event/2464/
The document discusses Steffen Staab's presentation on "The Web We Want" at the WebSci '17 conference. It covers several topics related to making the web more inclusive, healthy, and useful. For social inclusion, it describes the MAMEM project which aims to measure how accessible the web is for people with disabilities. For a healthy web, it discusses using techniques from social network analysis to identify harmful roles and behaviors. For a useful semantic web, it presents principles for interlinking data sets in ways that meaningfully extend entity descriptions and connectivity. The overall goal is to engineer and measure how well the web achieves important values like inclusion, health, and usefulness.
This document summarizes a presentation on the next 10 years of Web Science. It discusses social challenges like discrimination and trust, legal challenges regarding regulation and tracking, political challenges from misinformation and participation, and technical challenges from artificial intelligence and security. The presentation outlines the 10 year initiative of the Web Science Network of laboratories and highlights talks from researchers at companies like Google, Facebook, and Stanford. It promotes collaborative projects like the Web Science Observatory and Summer School.
(Semi-)Automatic analysis of online contentsSteffen Staab
How can media and discourse analyses combine approaches from humanities and statistical methods to deeply analyse large amounts of online contents.
Invited talk at Fachgruppen-Workshop der Deutschen Gesellschaft für Publizistik und Kommunikationswissenschaft
Soziale Medien – Echo-Kammer oder öffentlicher Raum?
Ansätze zur computergestützten Analyse von Internet-Korpora
6. Oktober 2016, Karlsruher Institut für Technologie (KIT)
Joint Keynote at Int. Conference on Knowledge Engineering and Semantic Web and Prague Computer Science Seminar, Prague, September 22, 2016
The challenges of Big Data are frequently explained by dealing with Volume, Velocity, Variety and Veracity. The large variety of data in organizations results from accessing different information systems with heterogeneous schemata or ontologies. In this talk I will present the research efforts that target the management of such broad data.
They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
This document provides an overview of a workshop on web science. It includes an agenda with topics such as an introduction to web science, aspects of the web, observing the web through web observatories, modeling aspects of the web, and the past and future of the web. It also provides details about project work sessions and social events during the workshop. Examples of bias in the web are discussed, such as bias in devices, software, content and data, and social networks. Methods for observing and collecting data from the web are addressed, along with challenges around data collection and publishing.
This document discusses the past 10 years and future of Web Science. It provides an overview of how the Web has evolved from a place to retrieve documents to a platform for coordination, monitoring, delivering services and understanding data. Web Science has progressed from case studies to developing concepts like the "Social Machine" and models of tagging. The document poses questions to a panel of experts about the strengths, weaknesses, opportunities and threats for Web Science over the past 10 years and what the next 10 years may bring.
The document summarizes the closing session of ISWC 2015, including award winners. It lists the winners of the People's Choice Poster Award, People's Choice Demo Award, Best Poster Award, Best Demo Award, Best Applied Paper Award, and Best Research Paper Award. It thanks attendees for their participation at ISWC 2015 and looks forward to ISWC 2016 in Kobe, Japan.
This document provides an overview and schedule for ISWC 2015 held from October 11-15, 2015. It summarizes attendance statistics, the research and applied paper submission and review process, award nominees, and highlights of the program including keynotes, paper sessions, and social events. The general chair is Steffen Staab from the University of Koblenz-Landau and University of Southampton. ISWC 2015 aims to bring together researchers and practitioners in the fields of semantic web and linked data.
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
React.js, a JavaScript library developed by Facebook, has gained immense popularity for building user interfaces, especially for single-page applications. Over the years, React has evolved and expanded its capabilities, becoming a preferred choice for mobile app development. This article will explore why React.js is an excellent choice for the Best Mobile App development company in Noida.
Visit Us For Information: https://www.linkedin.com/pulse/what-makes-reactjs-stand-out-mobile-app-development-rajesh-rai-pihvf/
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...Luigi Fugaro
Vector databases are transforming how we handle data, allowing us to search through text, images, and audio by converting them into vectors. Today, we'll dive into the basics of this exciting technology and discuss its potential to revolutionize our next-generation AI applications. We'll examine typical uses for these databases and the essential tools
developers need. Plus, we'll zoom in on the advanced capabilities of vector search and semantic caching in Java, showcasing these through a live demo with Redis libraries. Get ready to see how these powerful tools can change the game!
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
Manyata Tech Park Bangalore_ Infrastructure, Facilities and Morenarinav14
Located in the bustling city of Bangalore, Manyata Tech Park stands as one of India’s largest and most prominent tech parks, playing a pivotal role in shaping the city’s reputation as the Silicon Valley of India. Established to cater to the burgeoning IT and technology sectors
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
The Comprehensive Guide to Validating Audio-Visual Performances.pdfkalichargn70th171
Ensuring the optimal performance of your audio-visual (AV) equipment is crucial for delivering exceptional experiences. AV performance validation is a critical process that verifies the quality and functionality of your AV setup. Whether you're a content creator, a business conducting webinars, or a homeowner creating a home theater, validating your AV performance is essential.
Superpower Your Apache Kafka Applications Development with Complementary Open...Paul Brebner
Kafka Summit talk (Bangalore, India, May 2, 2024, https://events.bizzabo.com/573863/agenda/session/1300469 )
Many Apache Kafka use cases take advantage of Kafka’s ability to integrate multiple heterogeneous systems for stream processing and real-time machine learning scenarios. But Kafka also exists in a rich ecosystem of related but complementary stream processing technologies and tools, particularly from the open-source community. In this talk, we’ll take you on a tour of a selection of complementary tools that can make Kafka even more powerful. We’ll focus on tools for stream processing and querying, streaming machine learning, stream visibility and observation, stream meta-data, stream visualisation, stream development including testing and the use of Generative AI and LLMs, and stream performance and scalability. By the end you will have a good idea of the types of Kafka “superhero” tools that exist, which are my favourites (and what superpowers they have), and how they combine to save your Kafka applications development universe from swamploads of data stagnation monsters!
1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Storing and Querying Semantic Data
in the Cloud
Reasoning Web Summer School 2018 (RW 2018)
Daniel Janke & Steffen Staab
24.09.2018
2. Storing and Querying Semantic Data in the Cloud 2Daniel Janke & Steffen Staab
Amount of Available RDF Data Increases
Source: https://lod-cloud.net/
3. Storing and Querying Semantic Data in the Cloud 3Daniel Janke & Steffen Staab
Why using RDF Stores in the Cloud?
Example 1: Wikidata
Ÿ Dataset size: 4.9 billion triples (as of April 2018)
Ÿ Stored in distributed BlazeGraph RDF store because
– Higher query throughput
– Higher availability
Example 2: BBC
Ÿ On average 1 million SPARQL queries per day (in 2010)
Ÿ Stored in distributed GraphDB RDF store because
– Higher query throughput
– Higher availability
4. Storing and Querying Semantic Data in the Cloud 4Daniel Janke & Steffen Staab
Assumptions of this talk
1. There are exceptions for (almost) everything
2. You are always allowed to ask questions
3. You have some knowledge
Required
l RDF
l SPARQL
Helpful
l Cloud processing frameworks like Hadoop or Spark
l Query processing in relational databases
If not -> See 2.
Timeplan
5. Storing and Querying Semantic Data in the Cloud 5Daniel Janke & Steffen Staab
How to deal with increasing volume of RDF?
6. Storing and Querying Semantic Data in the Cloud 6Daniel Janke & Steffen Staab
Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
7. Storing and Querying Semantic Data in the Cloud 7Daniel Janke & Steffen Staab
Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object
8. Storing and Querying Semantic Data in the Cloud 8Daniel Janke & Steffen Staab
Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern
9. Storing and Querying Semantic Data in the Cloud 9Daniel Janke & Steffen Staab
Terminology: Query Execution Tree
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
11. Storing and Querying Semantic Data in the Cloud 11Daniel Janke & Steffen Staab
Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
Advantages
Ÿ Less complex than RDF stores running on several computers
Disadvantages
Ÿ Hardware of computer limits the size of processable RDF graph
Ÿ No fault tolerance
12. Storing and Querying Semantic Data in the Cloud 12Daniel Janke & Steffen Staab
RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
1 In the following, compute and storage nodes
are referred to as simply compute nodes.
13. Storing and Querying Semantic Data in the Cloud 13Daniel Janke & Steffen Staab
How to place the data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
14. Storing and Querying Semantic Data in the Cloud 14Daniel Janke & Steffen Staab
Where to find the required data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
15. Storing and Querying Semantic Data in the Cloud 15Daniel Janke & Steffen Staab
How to distribute the query processing?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?name
“Martin”
“Daniel”
?v1 ?name
g:wanja “Wanja”
16. Storing and Querying Semantic Data in the Cloud 16Daniel Janke & Steffen Staab
RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
Advantages
Ÿ Scalable by adding new compute or storage nodes
– Scaling up the dataset size
– Scaling up the query throughput
Ÿ Possibly fault tolerant
Disadvantages
Ÿ Higher complexity
1 In the following, compute and storage nodes
are referred to as simply compute nodes.
17. Storing and Querying Semantic Data in the Cloud 17Daniel Janke & Steffen Staab
Challenges of RDF Stores in the Cloud
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Many ideas from 50 years of data engineering carry over
-> We focus on approaches more commonly used for RDF
18. Storing and Querying Semantic Data in the Cloud 18Daniel Janke & Steffen Staab
#Related Work about RDF Stores
1) How to design the architecture?
2)How to distribute the data?
3)How to identify compute nodes that store required data?
4)How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Rarely considered
on its own
19. Storing and Querying Semantic Data in the Cloud 19Daniel Janke & Steffen Staab
Architecture Types
How to design the architecture?
20. Storing and Querying Semantic Data in the Cloud 20Daniel Janke & Steffen Staab
Properties of Architecture Types
Implementation complexity:
Ÿ How difficult is the implementation?
Freedom of data placement:
Ÿ To which extent can the data placement be influenced?
Query overhead:
Ÿ Which query overhead is caused by the architecture?
Scalability:
Ÿ To which extent do the storage and query processing capabilities
increase if further compute nodes are added?
Fault tolerance:
Ÿ Do single point of failures exist?
Ÿ How easily can they be removed?
21. Storing and Querying Semantic Data in the Cloud 21Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
22. Storing and Querying Semantic Data in the Cloud 22Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
23. Storing and Querying Semantic Data in the Cloud 23Daniel Janke & Steffen Staab
RDF Stores Using
Cloud Computing Frameworks
Converts and
loads
RDF graph into
cloud computing
framework
Translates SPARQL
queries into task(s)
for cloud computing
framework
Architecture
Cloud computing
Distributed
Federated
Examples: SHARD, S2RDF, S2X, TripleRush, Jena-Hbase, Sempala, D-SPARQ
24. Storing and Querying Semantic Data in the Cloud 24Daniel Janke & Steffen Staab
Cloud Computing Framework Types
RDF stores using
cloud computing
frameworks
Batch processing
frameworks
Graph processing
frameworks
NoSQL databases Column stores
Document stores
Architecture
Cloud computing
Distributed
Federated
Key-value stores
Distinction based on implementation
Architecture
25. Storing and Querying Semantic Data in the Cloud 25Daniel Janke & Steffen Staab
Batch Processing Frameworks
Ÿ Example frameworks: Hadoop, Spark
Ÿ Queries need to be translated into one or several tasks
Ÿ Data exchange between compute nodes via file system
Cloud computing
Batch
Graph
NoSQL
Distributed file system
1. Read input data
2. Process data
3. Write results back
26. Storing and Querying Semantic Data in the Cloud 26Daniel Janke & Steffen Staab
Graph Processing Frameworks
Ÿ Examples: GraphX, Signal/Collect
Ÿ Translation of queries in vertex algorithms
At each vertex:
1. Receive messages
2. Process messages
and update vertex
status
3. Send messages
Termination:
Status of all vertices do
not change any more
Cloud computing
Batch
Graph
NoSQL
27. Storing and Querying Semantic Data in the Cloud 27Daniel Janke & Steffen Staab
Key-Value Stores
Ÿ Example: DynamoDB
Ÿ Distributed map that assigns keys to arbitrary values
Ÿ Values are atomic
Ÿ Distribution based on, e.g., hash of the key, key ranges, …
Ÿ Query translated to several lookups in the map and joins on the
master
g:Gesis
g:wanja
...
e:employs g:wanja, ...
f:knows w:daniel, ...
...
w:WeST
w:martin
...
e:employs w:martin, ...
f:knows g:wanja, ...
...
Cloud computing
Batch
Graph
NoSQL
28. Storing and Querying Semantic Data in the Cloud 28Daniel Janke & Steffen Staab
Column Stores
Ÿ Examples: HBase, Cassandra, Accumulo, Impala
Ÿ Stores tabular data column-wise
Ÿ Maps column name and key to corresponding value
Ÿ Values are atomic
Ÿ Distributes key-value mappings based on keys for each column
separately
g:Gesis
w:WeST
g:wanja
w:martin, w:daniel
g:wanja
w:martin
w:daniel
w:daniel
g:wanja
w:martin
Column e:employs
Column f:knows
Cloud computing
Batch
Graph
NoSQL
29. Storing and Querying Semantic Data in the Cloud 29Daniel Janke & Steffen Staab
Document Stores
Ÿ Examples: Couchbase, MongoDB
Ÿ Store documents with internal structure (e.g., JSON)
(i.e., non-atomic documents = more freedom to model content)
Ÿ Provide indices over documents
Ÿ Distribution based on a key within documents
{_id: “g:Gesis”,
e:employs: “g:wanja”}
{_id: “w:WeST”,
e:employs: [“w:daniel”, “w:martin”]}
{_id: “g:wanja”,
f:knows: “w:daniel”,
f:givenname: “Wanja”}
{_id: “w:martin”,
f:knows: “g:wanja”,
f:givenname: “Martin”}
Cloud computing
Batch
Graph
NoSQL
30. Storing and Querying Semantic Data in the Cloud 30Daniel Janke & Steffen Staab
RDF Stores Using
Cloud Computing Frameworks
Pros:
Ÿ Low implementation complexity
Ÿ Fault tolerance provided by cloud computing framework
Ÿ Scalability provided by cloud computing framework
Ÿ Cloud computing framework is maintained and improved by a
community
Cons:
Ÿ Influence on data placement limited
Ÿ High overhead introduced by cloud computing framework
Ÿ Centralized join of data obtained by single lookups in NoSQL
databases might overload master
Architecture
Cloud computing
Distributed
Federated
31. Storing and Querying Semantic Data in the Cloud 31Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
32. Storing and Querying Semantic Data in the Cloud 32Daniel Janke & Steffen Staab
Federated RDF Stores Architecture
Cloud computing
Distributed
Federated
l Stores RDF data
l Administrated
independently
Coordinates query
execution:
l Decompose query
l Query RDF stores
l Join query results
Stores which data
is contained in
each RDF store
Caches data
retrieved from
previous queries
l Varied by index and cache
l Examples: DARQ, FedX, SPLENDID
33. Storing and Querying Semantic Data in the Cloud 33Daniel Janke & Steffen Staab
Pros:
Ÿ Low implementation complexity
Ÿ Scalability by adding new RDF stores
Cons:
Ÿ No influence on data placement
Ÿ Query federator is a single point of failure
Ÿ Centralized join of results from different RDF stores may become a
bottleneck
Ÿ Identification of RDF stores contributing to a query may be costly
Architecture
Cloud computing
Distributed
Federated
Federated RDF Stores
34. Storing and Querying Semantic Data in the Cloud 34Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
35. Storing and Querying Semantic Data in the Cloud 35Daniel Janke & Steffen Staab
Distributed RDF Stores Architecture
Cloud computing
Distributed
Federated
Distributed RDF stores
Master-slave architecture
Peer-to-peer architecture
Architecture
36. Storing and Querying Semantic Data in the Cloud 36Daniel Janke & Steffen Staab
Master-Slave Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Loading Graph:
1.Translate strings to fixed-length identifiers
2.Assigns triples to slaves
3.Stores which data is stored at which slave
4.Transfer triples to slaves
5.Store RDF triples locally
Querying:
1. Translate constant
strings to their integer
identifiers
2. Check occurrences of
constants
3. Decompose query and
send subqueries to
slaves
4. Execute subqueries
on local data
5. Join intermediate
results
6. Translate result ids
back to strings
L1, Q1, Q6
L2
L3, Q2
Q3, Q5
Q4, Q5
L5, Q4
Examples: GraphDB, BlazeGraph, TriAD, DiploCloud
37. Storing and Querying Semantic Data in the Cloud 37Daniel Janke & Steffen Staab
Peer-to-Peer Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Responsibilities of master are copied to all slaves resulting in peer
nodes with identical architecture but varying data
Examples: RDFPeers, Edutella, Grid Vine, 3RDF
38. Storing and Querying Semantic Data in the Cloud 38Daniel Janke & Steffen Staab
Pros:
Ÿ Full freedom on data placement
Ÿ Little query processing overhead
Ÿ Direct transfer of intermediate results
Ÿ Fault tolerance (in case of peer-to-peer)
Cons:
Ÿ High implementation complexity
Ÿ Master is a single point of failure
Ÿ Handling of dictionary, index and query coordination may lead to a
bottleneck at master
Architecture
Cloud computing
Distributed
Federated
Distributed RDF Stores
39. Storing and Querying Semantic Data in the Cloud 39Daniel Janke & Steffen Staab
Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Freedom of
data placement
Low/Medium – cloud
computing framework
decides about data
placement
Low – RDF stores
are administrated
independent of
federator
High – data
placement strategy
needs to be
implemented
Fault Tolerance High – master is
stateless and can be
replicated
Low – federator is
single point of
failure
High (peer-to-peer)
Low – master is
single point of failure
Scalability High/Medium –
possible
bottlenecks:
l Disk I/O
l Master-based joins
Medium – federator
can become
bottleneck
High (peer-to-peer)
Medium – if master
becomes bottleneck
40. Storing and Querying Semantic Data in the Cloud 40Daniel Janke & Steffen Staab
Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Query
overhead
High – initialisation of
cloud computing
framework
Medium –
identification of
required RDF
stores
Low – designed to
execute queries
efficiently
Implementation
complexity
Low – only
translation of RDF
dataset and SPARQL
queries
Medium –
dedicated querying,
indexing and
caching strategies
required
High – all
components needs
to be implemented
41. Storing and Querying Semantic Data in the Cloud 41Daniel Janke & Steffen Staab
Data Placement Strategies
How to distribute the data?
42. Storing and Querying Semantic Data in the Cloud 42Daniel Janke & Steffen Staab
Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object
43. Storing and Querying Semantic Data in the Cloud 43Daniel Janke & Steffen Staab
Terminology: Graph Cover and Graph Chunk
Graph cover (aka sharding)
Assignment of each triple to at least one compute node
Graph chunk (aka shard)
Set of triples assigned to a single compute node
Compute Node 1 Compute Node 2
w:martin
“Martin“
g:wanja
“Wanja“ w:daniel “Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employsf:knows
f:knows
f:knows
g:bello
r:type
e:employs
e:ownedBy
g:Dog
44. Storing and Querying Semantic Data in the Cloud 44Daniel Janke & Steffen Staab
Terminology: Path and Path Length
Path
A sequence of triples in which the object of a triple is the subject of the
succeeding triple
Path length
The number of triples in the path
w:martin g:wanja “Wanja“w:daniel
f:givennamef:knowsf:knows
Length = 3
45. Storing and Querying Semantic Data in the Cloud 45Daniel Janke & Steffen Staab
Terminology: Molecule, Anchor Vertex and Diameter
Molecule
Ÿ Set of triples that are contained in some paths starting at a vertex
called anchor vertex
Ÿ If molecule contains a subject s than all triples with s as subject are
contained
(Directed) molecule diameter
Longest shorted path between anchor vertex and all objects contained
in the molecule
w:martin
“Martin“
g:wanja
“Wanja“
f:givenname
f:givenname
f:knows
w:daniel
f:knows
Anchor vertex
Diameter = 2
46. Storing and Querying Semantic Data in the Cloud 46Daniel Janke & Steffen Staab
Properties of Graph Cover Strategies
Complexity:
Ÿ How complex is the creation of the graph cover?
Balancing:
Ÿ How balanced are the sizes of the resulting graph chunks?
Storage size:
Ÿ Is the sum of all graph chunks sizes larger than the original graph
size?
Path containment:
Ÿ How likely is it that a path can be traversed without leaving one
chunk?
Query parallelisation:
Ÿ How good can the workload of one query be parallelized among
several compute nodes?
Dynamics:
47. Storing and Querying Semantic Data in the Cloud 47Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
48. Storing and Querying Semantic Data in the Cloud 48Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
49. Storing and Querying Semantic Data in the Cloud 49Daniel Janke & Steffen Staab
Cloud-Computing-Based
Graph Cover Strategies
Ÿ Data placement is mainly decided by cloud computing framework
Ÿ Influenced only by
– Splitting graph into files or tables
– Encoding of data within files or tables
Ÿ Goal: Reduce the processing effort of queries
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
50. Storing and Querying Semantic Data in the Cloud 50Daniel Janke & Steffen Staab
Molecule Graph Splits
Ÿ Split graph into molecules of directed diameter 1
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
51. Storing and Querying Semantic Data in the Cloud 51Daniel Janke & Steffen Staab
Molecule Graph Splits
Ÿ Store molecules in key-value store (e.g., SHARD, Sempala)
Ÿ Store molecules in one or several files (e.g., D-SPARQ, RAPID+)
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
g:Gesis
g:wanja
e:employs gesis:wanja
f:knows w:daniel, f:givenname “Wanja”
w:WeST
w:martin
...
e:employs w:martin, e:employs w:daniel
f:knows g:wanja, f:givenname “Martin”
...
g:Gesis : (e:employs gesis:wanja)
g:wanja : (f:knows w:daniel), (f:givenname “Wanja”)
w:WeST : (e:employs w:martin), (e:employs w:daniel)
w:martin : (f:knows g:wanja), (f:givenname “Martin”)
...
52. Storing and Querying Semantic Data in the Cloud 52Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Ÿ Selection of required molecules easy, if subjects are given in the
context
Ÿ Subject-subject joins can be easily processed
Cons:
Ÿ If subject is not given in the context all molecules have to be
processed
Ÿ Extending molecules by incoming edges or longer diameters
increases dataset size
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Molecule Graph Splits
53. Storing and Querying Semantic Data in the Cloud 53Daniel Janke & Steffen Staab
Vertical Graph Splits
Ÿ Create a file/table for each property
Ÿ Store all triples with that property in the file/table
Ÿ Examples: Jena-HBase, SPARQLGX
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
54. Storing and Querying Semantic Data in the Cloud 54Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Cons:
Ÿ Queries that match with a path of length l will match with at most l
files/tables, if the property is given in the context
Ÿ Files/tables of frequent properties like rdf:type can become
large
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Vertical Graph Splits
55. Storing and Querying Semantic Data in the Cloud 55Daniel Janke & Steffen Staab
Hash-Based
Graph Cover Strategies
Ÿ Assignment of triples based on a hash function
Ÿ Possible properties of hash functions
– Determinism
The same input will always produce the same output
– Uniformity
Inputs are evenly mapped over output range
– Non-invertible
Based on a hash value the input datum cannot be reconstructed
– Continuity
The order of the hash values reflect the order of the input values
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
56. Storing and Querying Semantic Data in the Cloud 56Daniel Janke & Steffen Staab
Hash Cover
Hash function applied on the subjects:
Result:
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
57. Storing and Querying Semantic Data in the Cloud 57Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Ÿ Chunks are of almost equal size
Cons:
Ÿ Paths are more likely to contain triples that were assigned to
different compute nodes
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Hash Cover
58. Storing and Querying Semantic Data in the Cloud 58Daniel Janke & Steffen Staab
Graph-Clustering-Based
Graph Cover Strategies
Graph clustering
Ÿ Split graph into pairwise disjoint graph chunks, i.e., partitions (aka
shards)
Ÿ Usually vertices are assigned to partitions
Ÿ Partitions satisfy some clustering properties
Vertex-cut transformation:
Ÿ In RDF triples cannot be cut
Ÿ Assign triples to partition to which the subject was assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
59. Storing and Querying Semantic Data in the Cloud 59Daniel Janke & Steffen Staab
Minimal Edge-Cut Cover
Ÿ Number of cut edges should be reduced
Ÿ Number of vertices in each partition should be ideally the same
Ÿ After vertex-cut transformation:
Number of edges per partition is unbalanced
Ÿ Examples: [Huang2011], [Peng2016]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
60. Storing and Querying Semantic Data in the Cloud 60Daniel Janke & Steffen Staab
Pros:
Ÿ Likelihood that a path only contains triples of the same compute node is
high
Ÿ #vertices per chunk is balanced
Cons:
Ÿ High computational effort (heuristic approaches are in O(|V|*log(|V|))
Ÿ #triples per chunk is unbalanced
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Minimal Edge-Cut Cover
4 vertices
7 triples
4 vertices
3 triples
61. Storing and Querying Semantic Data in the Cloud 61Daniel Janke & Steffen Staab
Workload-Aware
Graph Cover Strategies
General idea:
Assign triples based on a historic query workload
General procedure:
1. Generalize from actual queries to handle unseen queries
2. Identify triples that are required to answer generalized queries
3. Assign triples to compute nodes
– All triples required to produce all query results are assigned to
the same compute node
– Distribute triple sets for the individual results equally among all
compute nodes
Examples: WARP, DiploCloud
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
62. Storing and Querying Semantic Data in the Cloud 62Daniel Janke & Steffen Staab
Pros:
Ÿ Good query performance for queries similar to the ones in the
historic query workload
Cons:
Ÿ High computational effort
Ÿ Historic query workload required
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Workload-Aware
Graph Cover Strategies
63. Storing and Querying Semantic Data in the Cloud 63Daniel Janke & Steffen Staab
n-hop Replication
Ÿ Based on an initial graph cover with chunks
Ÿ Replicate triples such that all paths of length n
– Starting at a subject contained in chunk
– Consist of triples assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Example: VB-Partitioner
64. Storing and Querying Semantic Data in the Cloud 64Daniel Janke & Steffen Staab
Pros:
Ÿ Paths of length <=n are guaranteed to belong to one chunk
Cons:
Ÿ Higher computational effort
Ÿ Dataset size increases
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
n-hop Replication
65. Storing and Querying Semantic Data in the Cloud 65Daniel Janke & Steffen Staab
Summary of
Static Graph Cover Strategies
Cloud Hash Clustering Workload N-hop
Complexity Low Low High High Medium
Chunk sizes Imbalanced Balanced Imbalanced - -
Dataset size 100% 100% 100% >= 100% > 100%
Path
containment
Low Low High High Medium
Query
parallelization
Medium High Low Low/High -
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
66. Storing and Querying Semantic Data in the Cloud 66Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
67. Storing and Querying Semantic Data in the Cloud 67Daniel Janke & Steffen Staab
Dynamic Graph Cover Strategies
Ÿ Adaptation of graph cover during runtime
Ÿ Types of dynamics
– Adaptation of graph cover to actual query workload
– If one chunk becomes overloaded due to insertions of new
triples, move triples to other chunks
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
68. Storing and Querying Semantic Data in the Cloud 68Daniel Janke & Steffen Staab
Adaptation to
Actual Query Workload
Ÿ Initial static graph cover
Ÿ Keep track how frequently
- triple patterns
- molecules
are queried together
Ÿ Replicate triples such that
– Data transfer is reduced
– Workload is equally distributed among compute nodes
Examples: PHD-Store, AdHash, Sedge
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
69. Storing and Querying Semantic Data in the Cloud 69Daniel Janke & Steffen Staab
Dynamic Redistribution of Triples
Ÿ If one compute node stores too many triples (in comparison to
others), redistribute triples based on their hash values
Ÿ If triples are stored in an ordered fashion, send one half to another
compute node
Examples: [Battré2007], [Osorio2017]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
70. Storing and Querying Semantic Data in the Cloud 70Daniel Janke & Steffen Staab
Indices
How to identify compute nodes that store required data?
71. Storing and Querying Semantic Data in the Cloud 71Daniel Janke & Steffen Staab
Example
Where is the information stored to answer the query:
How are the employees of WeST called?
Hash cover on subjects
72. Storing and Querying Semantic Data in the Cloud 72Daniel Janke & Steffen Staab
Properties of Indices
Graph cover independence:
Ÿ How independent is the index from the graph cover strategy?
Storage consumption:
Ÿ How much storage space is required for the index
Access time:
Ÿ How fast can the location of an indexed element be retrieved?
Indexed elements:
Ÿ Which elements are indexed?
73. Storing and Querying Semantic Data in the Cloud 73Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
74. Storing and Querying Semantic Data in the Cloud 74Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
75. Storing and Querying Semantic Data in the Cloud 75Daniel Janke & Steffen Staab
Centralized Hash-Based Index
Ÿ Applicable only for hash covers
Ÿ No explicit index required
Ÿ Location of a triple can be recomputed by the hash function and the
number of chunks
Ÿ Examples: 4store, Trinity.RDF
How are the employees of WeST called?
hash(w:WeST) → compute node 2
e:employs ?
f:givenname ?
(w:WeST, e:employs) ?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
76. Storing and Querying Semantic Data in the Cloud 76Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute occurrences
Ÿ No explicit index required
– No storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Centralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
77. Storing and Querying Semantic Data in the Cloud 77Daniel Janke & Steffen Staab
Centralized Statistics-Based Index
Ÿ Collect occurrences of
– Subject, property, object labels
– Combinations of subject, property, object labels
– RDFs types
– Property sets of molecules
Ÿ Examples: DARQ, FedX, Sedge
Subject Property Object
c1 c2 c1 c2 c1 c2
w:WeST 0 2 0 0 0 0
e:employs 0 0 1 2 0 0
f:givenname 0 0 2 1 0 0
... ... ... ...
How are the employees of WeST called?
Chunk IDs
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
78. Storing and Querying Semantic Data in the Cloud 78Daniel Janke & Steffen Staab
Pros:
Ÿ Independent of graph cover strategy
Ÿ Can estimate number of results
Ÿ Fast access
Cons:
Ÿ Requires compression for storage
Ÿ Trade off:
– Collecting only a few statistics → small size → less useful
– Collecting many statistics → large size (possibly size of dataset)
→ more useful
Centralized Statistics-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
79. Storing and Querying Semantic Data in the Cloud 79Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: TriAD
Summarization algorithm:
1) Each chunk represented by chunk vertex
2) Start and end vertices of edges are substituted by corresponding
chunk vertices
3) Duplicate edges are removed
How are the employees of WeST called?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
80. Storing and Querying Semantic Data in the Cloud 80Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: EAGRE
Summarization algorithm:
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
81. Storing and Querying Semantic Data in the Cloud 81Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: EAGRE
Summarization algorithm:
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
How are the employees of WeST called?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
82. Storing and Querying Semantic Data in the Cloud 82Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index
Pros:
Ÿ Independent of graph cover strategy
Ÿ Identification of subqueries that can be answered locally
Cons:
Ÿ All triples with same subject have to be assigned to the same
compute node
Ÿ High storage consumption
Ÿ Summary graph needs to be queried
Ÿ Only properties are considered
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
83. Storing and Querying Semantic Data in the Cloud 83Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
84. Storing and Querying Semantic Data in the Cloud 84Daniel Janke & Steffen Staab
Decentralized Hash-Based Index
Ÿ Version 1:
– Centralized hash-based index on each compute node
– Knowledge of all compute nodes required
– Examples: HDRS, Virtuoso Clustered Edition
Ÿ Version 2:
– Each compute node knows a forward table for a few neighbours
▪ Ring structure overlay (e.g., RDFPeers, PAGE)
▪ Tree structure overlay (e.g., Grid Vine, 3RDF)
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
85. Storing and Querying Semantic Data in the Cloud 85Daniel Janke & Steffen Staab
Ring Structure Overlay
Ÿ Compute nodes are ordered
Ÿ Each compute node knows
– Its direct neighbour
– A few distant neighbours
Ÿ When a request arrives
1)The compute node storing the
data is determined by the hash
function
2)Request is forwarded to the
(closest) compute node storing
the data
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
86. Storing and Querying Semantic Data in the Cloud 86Daniel Janke & Steffen Staab
Tree Structure Overlay
Ÿ C1
– stores all data whose hash
value starts with prefix 00
– Knows C2 is responsible for
prefix 01
– Knows C3 is responsible for
prefix 1
Ÿ When request arrives C1
– Computes hash value
– Forwards request based on the
known prefixes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
87. Storing and Querying Semantic Data in the Cloud 87Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute occurrences
Ÿ Low storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Decentralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
88. Storing and Querying Semantic Data in the Cloud 88Daniel Janke & Steffen Staab
Decentralized Schema-Based Index
Ÿ Applicable for type-based graph covers
Ÿ Use type hierarchy as tree structure overlay
Ÿ Example: SQPeer
rdfs:Ressource
rdf:Property
e:employs f:givennamef:Person
rdfs:Class
e:Institute
C
1
C
2
C
3
C
4
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
89. Storing and Querying Semantic Data in the Cloud 89Daniel Janke & Steffen Staab
Pros:
Ÿ Queries that contain types can be forwarded to corresponding
compute node(s)
Ÿ Low storage consumption
Cons:
Ÿ Efficiently applicable only for type-based graph covers
Ÿ Types of requested resources need to be identified
Ÿ Unbalanced index sizes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Decentralized Schema-Based Index
Used in combination with other indices
90. Storing and Querying Semantic Data in the Cloud 90Daniel Janke & Steffen Staab
Summary Indices
Centralized Decentralized
Hash Statistics Summary
graph
Hash Schema
Applicable to
graph cover
strategies
Hash
covers
All All Hash
covers
Type-
based
covers
Storage
consumption
Low High High Low Low
Access time Fast Slow Slow Medium Medium
Indexed
elements
Hash
dependent
Various
aggregations
Properties Hash
dependent
Typed
elements
91. Storing and Querying Semantic Data in the Cloud 91Daniel Janke & Steffen Staab
Distributed Query Processing Strategies
How to distribute query processing?
92. Storing and Querying Semantic Data in the Cloud 92Daniel Janke & Steffen Staab
Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern
93. Storing and Querying Semantic Data in the Cloud 93Daniel Janke & Steffen Staab
Terminology: Query Execution Tree
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
95. Storing and Querying Semantic Data in the Cloud 95Daniel Janke & Steffen Staab
Distributed Query Processing
General procedure
1) Split query into subquery that can be executed locally
2) Execute subqueries on compute nodes identified by index
3) Join results of subqueries
4) Return results
96. Storing and Querying Semantic Data in the Cloud 96Daniel Janke & Steffen Staab
Splitting Query into Subqueries
Ÿ Simplest case: each triple pattern forms a subquery
Ÿ Use knowledge about graph covers
– All triples with same subject are stored on the same compute
node
– Paths of length n can be executed locally
Ÿ Use index information
– Co-occurrences of subject-property or property-property
97. Storing and Querying Semantic Data in the Cloud 97Daniel Janke & Steffen Staab
Properties of Join Operations
Parallelisation:
Ÿ Is the join computation distributed among several or all compute
nodes?
Computational effort:
Ÿ How many comparisons are performed during the join
computation?
Ÿ How many subqueries result out of the join computation?
Data transfer:
Ÿ How many intermediate results are transferred to compute the join?
Blocking:
Ÿ Do subqueries need to be finished before the join can be
computed?
98. Storing and Querying Semantic Data in the Cloud 98Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
99. Storing and Querying Semantic Data in the Cloud 99Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
100. Storing and Querying Semantic Data in the Cloud 100Daniel Janke & Steffen Staab
Centralized Nested Loop Join
Compare each element of first list with every element of second list
Examples: SPLENDID, DARQ
Pros:
Ÿ Does not require an ordering
Ÿ Arbitrary join conditions possible
Cons:
Ÿ Inefficient
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
w:daniel “Daniel”
101. Storing and Querying Semantic Data in the Cloud 101Daniel Janke & Steffen Staab
Centralized Merge Join
Ÿ Requires sorted intermediate result lists
Ÿ Compare one result r only with results that are <= r
Ÿ Example: Partout
Pros:
Ÿ Fast for ordered result sets
Cons:
Ÿ Slow for unordered result sets
Ÿ Intermediate result set size might lead to a bottleneck
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
w:daniel “Daniel”
w:martin “Martin”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
102. Storing and Querying Semantic Data in the Cloud 102Daniel Janke & Steffen Staab
Centralized Hash Join
Ÿ Assign results to buckets based on their hashes
Ÿ Join a result only with corresponding bucket
Ÿ Examples: ANAPSID, LHD
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
...
?v1 ?name
w:daniel “Daniel”
...
?v1 ?name
w:martin “Martin”
...
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
A non-blocking symmetric version exists
103. Storing and Querying Semantic Data in the Cloud 103Daniel Janke & Steffen Staab
Pros:
Ÿ No ordering required
Ÿ On average almost constant time complexity
Cons:
Ÿ Intermediate result set size might lead to a bottleneck
Centralized Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
104. Storing and Querying Semantic Data in the Cloud 104Daniel Janke & Steffen Staab
Bind Join
Ÿ Substitute variables of the second subquery based on results from first
subquery
Ÿ Second query is executed multiple times
Ÿ Examples: FedX, Avanalche, SemaGrow
?v1
w:martin
?v1 ?name
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
?v1
w:daniel
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
105. Storing and Querying Semantic Data in the Cloud 105Daniel Janke & Steffen Staab
Pros:
Ÿ Reduces the amount of intermediate results
Cons:
Ÿ Increases number of executed subqueries
Ÿ Possible bottlenecks:
– Large intermediate result set sizes
– Large number of subqueries
Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
106. Storing and Querying Semantic Data in the Cloud 106Daniel Janke & Steffen Staab
Summary Centralized Joins
Nested Merge Hash Symmetric Bind
Computational
effort
High Medium -
extra effort
for ordering
Low Low Medium -
effort of
many
subqueries
# executed
queries
Low Low Low Low High
Blocking
operation
Yes Yes Yes No Yes
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
107. Storing and Querying Semantic Data in the Cloud 107Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
108. Storing and Querying Semantic Data in the Cloud 108Daniel Janke & Steffen Staab
Replication-Based Distributed Join
All results of first subquery are sent to all compute nodes on which the
second subquery is executed
Example: SemStore
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
?v1
w:daniel
w:martin
?v1
w:daniel
w:martin
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
109. Storing and Querying Semantic Data in the Cloud 109Daniel Janke & Steffen Staab
Pros:
Ÿ Not all compute nodes are necessary involved in joining
Ÿ Using data locality → Less transferred data
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Ÿ One subtree needs to be finished before join can be executed
Replication-Based Distributed Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
110. Storing and Querying Semantic Data in the Cloud 110Daniel Janke & Steffen Staab
Distributed Hash Join
Hash join in which each compute node serves as a bucket
Example: DiploCloud
Compute Node 2Compute Node 1
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
?v1 ?name
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
?v1
w:martin
?v1
w:daniel
?v1 ?name
w:daniel “Daniel”
hash(w:martin)
hash(w:daniel)
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
111. Storing and Querying Semantic Data in the Cloud 111Daniel Janke & Steffen Staab
Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ No usage of data locality → high data transfer
Ÿ One subtree needs to be finished before join can be executed
Distributed Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
112. Storing and Querying Semantic Data in the Cloud 112Daniel Janke & Steffen Staab
Distributed Merge Join
Ÿ Results of subqueries are ordered
Ÿ Each compute node is responsible for a range of results
Ÿ Examples: H2RDF+, SHARD, SparkRDF, SPARQLGX
Compute Node 2Compute Node 1
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
Range a:a-w:d Range w:e-z:z
?v1 ?name
g:wanja “Wanja”
w:daniel “Daniel”
?v1
w:daniel
?v1
w:martin
?v1 ?name
w:martin “Martin”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
113. Storing and Querying Semantic Data in the Cloud 113Daniel Janke & Steffen Staab
Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ Results need to be ordered
Ÿ Agreement on result ranges required
Ÿ No usage of data locality → high data transfer
Ÿ One subtree needs to be finished before join can be executed
Distributed Merge Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
114. Storing and Querying Semantic Data in the Cloud 114Daniel Janke & Steffen Staab
Distributed Bind Join
Join algorithm:
1) Get results of first subquery
2) For each following bind join query,
1) Identify compute nodes with matches
2) Fork query execution to remote compute nodes
Examples: RDFPeers, GridVine, Atlas, TripleRush, Trinity.RDF
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
?v1
w:daniel
w:martin
?v1
w:martin
?v1
w:daniel
Fork
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
115. Storing and Querying Semantic Data in the Cloud 115Daniel Janke & Steffen Staab
Pros:
Ÿ Join computed without waiting for any subtree to be finished
Ÿ Usage of data locality → Less transferred data
Ÿ Results of last join operation do not need to be sent to other
compute nodes
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Distributed Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
116. Storing and Querying Semantic Data in the Cloud 116Daniel Janke & Steffen Staab
Distributed Joins Summary
Centralized
Joins
Distributed
Replication
Distributed
Hash
Distributed
Merge
Distributed
Bind
Data Transfer High Low High High Low
Parallelisation Low Medium High High Medium
# Subqueries Low Low Low Low High
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
117. Storing and Querying Semantic Data in the Cloud 117Daniel Janke & Steffen Staab
Fault Tolerance
How to achieve fault tolerance?
118. Storing and Querying Semantic Data in the Cloud 118Daniel Janke & Steffen Staab
Mirroring
Ÿ There exist several identical copies of each compute node
Ÿ If one compute node fails, its copy continues working
Ÿ Example: Virtuoso Clustered Edition
Pros:
Ÿ Query workload can be distributed among all copies
Cons:
Ÿ Keeping copies up to date
Ÿ Replicas of different chunks are not combined to increase data
locality
Compute Node 1 Compute Node 2 Compute Node 1’ Compute Node 2’
119. Storing and Querying Semantic Data in the Cloud 119Daniel Janke & Steffen Staab
Data Replication
Ÿ All compute nodes are ordered in a ring
Ÿ Data from one compute node is replicated on neighbours
Ÿ If one compute node fails, data remains available on neighbours
Ÿ Example: 4store, RDFPeers
Pros:
Ÿ Data locality of initial graph cover is increased
Cons:
Ÿ Keeping copies up to data
Compute Node 1 Compute Node 2 Compute Node 3
1
1’
2
2’
3
3’
120. Storing and Querying Semantic Data in the Cloud 120Daniel Janke & Steffen Staab
Evaluation Methodology
How to evaluate?
121. Storing and Querying Semantic Data in the Cloud 121Daniel Janke & Steffen Staab
Properties of Evaluation Methodologies
Realism:
Do the measurement results reflect the performance of real RDF
stores?
Modularity:
Can alternative implementations of individual components be
evaluated?
Evaluation depth:
Is the system evaluated only as a whole or are the performance of the
individual components evaluated?
Difficulty:
How difficult is it to apply the evaluation methodology?
122. Storing and Querying Semantic Data in the Cloud 122Daniel Janke & Steffen Staab
Black Box Evaluation
Evaluation of RDF stores as a whole
Some problems (of many):
Ÿ How fast is your network?
Ÿ How large are your images?
Ÿ Which processor configuration do you use?
Ÿ What are the structures of your caches?
Do you evaluate the RDF store or your hardware configuration?
Dataset
QueriesQueriesQueries
123. Storing and Querying Semantic Data in the Cloud 123Daniel Janke & Steffen Staab
Black Box Evaluation
Evaluation of RDF stores as a whole
Pros:
Ÿ Easy to perform evaluation since no implementation knowledge is
required
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Only superficial evaluations possible
Ÿ No performance evaluation of individual components possible
Dataset
QueriesQueriesQueries
124. Storing and Querying Semantic Data in the Cloud 124Daniel Janke & Steffen Staab
Glass Box Evaluation
Ÿ Evaluation of RDF stores as a whole
Ÿ Collecting performance measurements of components by
– Using a profiling system like Granula
– Adapting source code to perform measurements
Dataset
QueriesQueriesQueries
125. Storing and Querying Semantic Data in the Cloud 125Daniel Janke & Steffen Staab
Glass Box Evaluation
Pros:
Ÿ In-depth performance evaluation possible
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Source code needs to be extended to collect measurements
Ÿ Individual components can hardly be exchanged by alternative
implementations
126. Storing and Querying Semantic Data in the Cloud 126Daniel Janke & Steffen Staab
Simulation-based Glass Box Evaluation
Evaluation of alternative implementations of a single component by
simulating the behaviour of a real RDF store
Pros:
Ÿ Performance evaluation of individual components possible
Ÿ Alternative implementations of individual components is possible
Cons:
Ÿ Evaluation environment (simulator) needs to be implemented
Ÿ Questionable whether performance measurements reflect behaviour of
real RDF store
Dataset
QueriesQueriesQueries
ComponentComponent
Component
127. Storing and Querying Semantic Data in the Cloud 127Daniel Janke & Steffen Staab
Glass Box Evaluation Platform
RDF store
Ÿ that allows the exchange of individual components by alternative
implementations
Ÿ Measures performance of individual components
Dataset
QueriesQueriesQueries Graph Cover
Creator
Graph Cover
Creator
Graph Cover
Creator
128. Storing and Querying Semantic Data in the Cloud 128Daniel Janke & Steffen Staab
Glass Box Evaluation
Pros:
Ÿ In-depth performance evaluation possible
Ÿ Alternative implementations of individual components can be
evaluated
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Development of glass box evaluation platform difficult
Ÿ Interdependencies might limit the exchangeability of components
129. Storing and Querying Semantic Data in the Cloud 129Daniel Janke & Steffen Staab
Evaluation Methodology Summary
Black box Glass box Simulation Glass box
platform
Realism High High Low Medium
Modularity Low Low High High
Evaluation depth Low High High High
Difficulty Easy Medium Medium Hard
130. Storing and Querying Semantic Data in the Cloud 130Daniel Janke & Steffen Staab
Conclusion & Open Challenges
131. Storing and Querying Semantic Data in the Cloud 131Daniel Janke & Steffen Staab
Conclusion
Challenges of RDF stores in the cloud:
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
132. Storing and Querying Semantic Data in the Cloud 132Daniel Janke & Steffen Staab
Example RDF Stores in the Cloud
Virtuoso Clustered
Edition
BlazeGraph GraphDB
Architecture Master-slave Master-slave Master-slave
Graph Cover
Strategy
Hash cover Distributed B+-tree Replication of
graph on all slaves
Index Centralized hash-
based index on each
compute node
Distributed B+-tree Not necessary
Query
Execution
Strategy
Distributed bind join Centralized join Centralized join
Fault Tolerance Mirroring None Mirroring
133. Storing and Querying Semantic Data in the Cloud 133Daniel Janke & Steffen Staab
Example RDF Stores in the Cloud
DiploCloud S2RDF Trinity.RDF
Architecture Master-slave Batch processing
framework
Master-slave
Graph Cover
Strategy
Workload-aware Vertical graph splits Hash cover
Index Centralized
Statistics-based index
None Distributed
chunk-integrated
summary graph
Query
Execution
Strategy
Centralized join
(for small result sets)
Distributed hash join
(otherwise)
Distributed joins Distributed bind join
Fault Tolerance None Based on batch
processing
framework
None
134. Storing and Querying Semantic Data in the Cloud 134Daniel Janke & Steffen Staab
Challenges Not Presented
Ÿ How to achieve transactional security?
Ÿ How to perform online analytical processing (OLAP) queries?
Ÿ How to process property paths?
Ÿ How to perform distributed reasoning?
Ÿ How to perform distributed stream processing?
135. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Thank you for your Attention!
Daniel Janke, Steffen Staab
136. Storing and Querying Semantic Data in the Cloud 136Daniel Janke & Steffen Staab
Image References
Ÿ https://openclipart.org/detail/155101/server
Ÿ https://openclipart.org/detail/213252/gear-icon
Ÿ https://openclipart.org/detail/204067/bpm-mail-symbol
Ÿ https://openclipart.org/detail/169757/check-and-cross-marks
Ÿ https://openclipart.org/detail/153577/stopwatch
137. Storing and Querying Semantic Data in the Cloud 137Daniel Janke & Steffen Staab
References
[Huang2011] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. PVLDB
4(11), 1123–1134 (2011)
[Peng2016] Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D.: Processing SPARQL Queries over Distributed
RDF Graphs. The VLDB Journal 25(2), 243–268 (apr 2016).
[Battré2007] Battré, D., Heine, F., Höing, A., Kao, O.: On Triple Dissemination, Forward-Chaining, and Load
Balancing in DHT Based RDF Stores. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.H., Ouksel, A.M.
(eds.) Databases, Information Systems, and Peer-to-Peer Computing. pp. 343–354. Springer Berlin
Heidelberg, Berlin, Heidelberg (2007)
[Osorio1017] Osorio, M., Aranda, C.B.: Storage Balancing in P2P Based Distributed RDF Data Stores. In:
Proceedings of the Workshop on Decentralizing the Semantic Web 2017 co-located with 16th International
Semantic Web Conference (ISWC 2017) (2017).