Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software component for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed inmemory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.
Presentation done* at the 13th International Semantic Web Conference (ISWC) in which we approach a compressed format to represent RDF Data Streams. See the original article at: http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf
* Presented by Alejandro Llaves (http://www.slideshare.net/allaves)
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
Translation of Relational and Non-Relational Databases into RDF with xR2RMLFranck Michel
xR2RML is a mapping language that extends R2RML and RML to enable the translation of heterogeneous data sources, including relational databases, NoSQL databases, XML documents, JSON documents and more, to RDF. xR2RML provides a unified approach for describing mappings from various data models and query languages to RDF through the use of logical sources, references to data elements, and support for nested collections and cross-references between data sources. This allows for standardized translation of diverse data to the semantic web.
The document discusses plans by the Japan Link Center (JaLC) to expand its Digital Object Identifier (DOI) registration services. Currently JaLC registers DOIs for journal articles and will soon add additional content types like books, theses, reports, and research data. To gain experience with registering DOIs for research data, JaLC will conduct an experimental project involving participant organizations. The project aims to establish workflows for stable research data DOI registration and integration with DataCite standards. Testing is scheduled to begin in the fall. The addition of data and other content will help JaLC further its goal of supporting all researcher activities through persistent identification with DOIs.
Presented in : JIST2015, Yichang, China
Prototype: http://rc.lodac.nii.ac.jp/rdf4u/
Video: https://www.youtube.com/watch?v=z3roA9-Cp8g
Abstract: It is known that Semantic Web and Linked Open Data (LOD) are powerful technologies for knowledge management, and explicit knowledge is expected to be presented by RDF format (Resource Description Framework), but normal users are far from RDF due to technical skills required. As we learn, a concept-map or a node-link diagram can enhance the learning ability of learners from beginner to advanced user level, so RDF graph visualization can be a suitable tool for making users be familiar with Semantic technology. However, an RDF graph generated from the whole query result is not suitable for reading, because it is highly connected like a hairball and less organized. To make a graph presenting knowledge be more proper to read, this research introduces an approach to sparsify a graph using the combination of three main functions: graph simplification, triple ranking, and property selection. These functions are mostly initiated based on the interpretation of RDF data as knowledge units together with statistical analysis in order to deliver an easily-readable graph to users. A prototype is implemented to demonstrate the suitability and feasibility of the approach. It shows that the simple and flexible graph visualization is easy to read, and it creates the impression of users. In addition, the attractive tool helps to inspire users to realize the advantageous role of linked data in knowledge management.
This document provides an overview of the Resource Description Framework (RDF). It begins with background information on RDF including URIs, URLs, IRIs and QNames. It then describes the RDF data model, noting that RDF is a schema-less data model featuring unambiguous identifiers and named relations between pairs of resources. It also explains that RDF graphs are sets of triples consisting of a subject, predicate and object. The document also covers RDF syntax using Turtle and literals, as well as modeling with RDF. It concludes with a brief overview of common RDF tools including Jena.
This document provides an overview of the RDF data model. It discusses the history and development of RDF standards from 1997 to 2014. It explains that an RDF graph is made up of triples consisting of a subject, predicate, and object. It provides examples of RDF triples and their N-triples representation. It also describes RDF syntaxes like Turtle and features of RDF like literals, blank nodes, and language-tagged strings.
Presentation done* at the 13th International Semantic Web Conference (ISWC) in which we approach a compressed format to represent RDF Data Streams. See the original article at: http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf
* Presented by Alejandro Llaves (http://www.slideshare.net/allaves)
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
Translation of Relational and Non-Relational Databases into RDF with xR2RMLFranck Michel
xR2RML is a mapping language that extends R2RML and RML to enable the translation of heterogeneous data sources, including relational databases, NoSQL databases, XML documents, JSON documents and more, to RDF. xR2RML provides a unified approach for describing mappings from various data models and query languages to RDF through the use of logical sources, references to data elements, and support for nested collections and cross-references between data sources. This allows for standardized translation of diverse data to the semantic web.
The document discusses plans by the Japan Link Center (JaLC) to expand its Digital Object Identifier (DOI) registration services. Currently JaLC registers DOIs for journal articles and will soon add additional content types like books, theses, reports, and research data. To gain experience with registering DOIs for research data, JaLC will conduct an experimental project involving participant organizations. The project aims to establish workflows for stable research data DOI registration and integration with DataCite standards. Testing is scheduled to begin in the fall. The addition of data and other content will help JaLC further its goal of supporting all researcher activities through persistent identification with DOIs.
Presented in : JIST2015, Yichang, China
Prototype: http://rc.lodac.nii.ac.jp/rdf4u/
Video: https://www.youtube.com/watch?v=z3roA9-Cp8g
Abstract: It is known that Semantic Web and Linked Open Data (LOD) are powerful technologies for knowledge management, and explicit knowledge is expected to be presented by RDF format (Resource Description Framework), but normal users are far from RDF due to technical skills required. As we learn, a concept-map or a node-link diagram can enhance the learning ability of learners from beginner to advanced user level, so RDF graph visualization can be a suitable tool for making users be familiar with Semantic technology. However, an RDF graph generated from the whole query result is not suitable for reading, because it is highly connected like a hairball and less organized. To make a graph presenting knowledge be more proper to read, this research introduces an approach to sparsify a graph using the combination of three main functions: graph simplification, triple ranking, and property selection. These functions are mostly initiated based on the interpretation of RDF data as knowledge units together with statistical analysis in order to deliver an easily-readable graph to users. A prototype is implemented to demonstrate the suitability and feasibility of the approach. It shows that the simple and flexible graph visualization is easy to read, and it creates the impression of users. In addition, the attractive tool helps to inspire users to realize the advantageous role of linked data in knowledge management.
This document provides an overview of the Resource Description Framework (RDF). It begins with background information on RDF including URIs, URLs, IRIs and QNames. It then describes the RDF data model, noting that RDF is a schema-less data model featuring unambiguous identifiers and named relations between pairs of resources. It also explains that RDF graphs are sets of triples consisting of a subject, predicate and object. The document also covers RDF syntax using Turtle and literals, as well as modeling with RDF. It concludes with a brief overview of common RDF tools including Jena.
This document provides an overview of the RDF data model. It discusses the history and development of RDF standards from 1997 to 2014. It explains that an RDF graph is made up of triples consisting of a subject, predicate, and object. It provides examples of RDF triples and their N-triples representation. It also describes RDF syntaxes like Turtle and features of RDF like literals, blank nodes, and language-tagged strings.
Bernhard Haslhofer is a postdoc researcher at Cornell University studying linked data, user-contributed data, and data interoperability. He discusses Linked (Open) Data, which uses URIs and RDF to publish and link structured data on the web. The key principles are using URIs to identify things, providing useful information about those URIs when dereferenced, and including links to other URIs. Enabling technologies include URIs, RDF, RDFS/OWL for vocabularies, SPARQL for querying, and best practices for publishing vocabularies and data. Useful tools are also presented.
The web of interlinked data and knowledge strippedSören Auer
Linked Data approaches can help solve enterprise information integration (EII) challenges by complementing text on web pages with structured, linked open data from different sources. This allows for intelligently combining, integrating, and joining structured information across heterogeneous systems. A distributed, iterative, bottom-up integration approach using Linked Data may help solve the EII problem in large companies by taking a pay-as-you-go approach.
This document discusses rules and the Semantic Web Rule Language (SWRL). It defines rules as a means of representing knowledge similar to if-then statements. SWRL combines OWL and rule-based languages by allowing users to write rules that can refer to OWL classes, properties, individuals and datatypes. SWRL has an abstract and XML syntax and supports built-in predicates for manipulating data types. Rules provide more expressivity than RDFS and OWL in some cases, such as defining application behaviors, but rule-based reasoning is less performant so they should not be overused when RDFS/OWL suffice.
Lightening talk for Semantic Web in Libraries (SWIB13) conference at 2013-11-27 about another method of expressing RDF data. See http://gbv.github.io/aREF/ for a preliminary specification.
The International Federation of Library Associations and Institutions (IFLA) is responsible for the development and maintenance of International Standard Bibliographic Description (ISBD), UNIMARC, and the "Functional Requirements" family for bibliographic records (FRBR), authority data (FRAD), and subject authority data (FRSAD). ISBD underpins the MARC family of formats used by libraries world-wide for many millions of catalog records, while FRBR is a relatively new model optimized for users and the digital environment. These metadata models, schemas, and content rules are now being expressed in the Resource Description Framework language for use in the Semantic Web.
This webinar provides a general update on the work being undertaken. It describes the development of an Application Profile for ISBD to specify the sequence, repeatability, and mandatory status of its elements. It discusses issues involved in deriving linked data from legacy catalogue records based on monolithic and multi-part schemas following ISBD and FRBR, such as the duplication which arises from copy cataloging and FRBRization. The webinar provides practical examples of deriving high-quality linked data from the vast numbers of records created by libraries, and demonstrates how a shift of focus from records to linked-data triples can provide more efficient and effective user-centered resource discovery services.
Wi2015 - Clustering of Linked Open Data - the LODeX toolLaura Po
Presentation of the tool LODeX (http://www.dbgroup.unimore.it/lodex2/testCluster) at the 2015 IEEE/WIC/ACM International Conference on Web Intelligence, Singapore, December 6-8, 2015
This document discusses three methods for storing and querying RDF data using graph databases: universal, rigid, and flexible. The universal method uses a simple and unique data structure. The rigid method is schema-based. The flexible method is schema-adaptable. A preliminary prototype was created using the flexible method, which works better than the other two methods for some query types. Requirements for further development include a formal property graph data model definition and a standard query language.
Deriving an Emergent Relational Schema from RDF DataGraph-TA
This document discusses deriving an emergent relational schema from RDF data. It describes extracting characteristic sets from RDF data to recognize classes and relationships between classes. These characteristic sets are then merged and labeled to create a logical relational schema. This emergent schema provides benefits for both systems through improved efficiency and humans through easier query formulation over the RDF data. Key aspects of a useful emergent schema are discussed such as being compact, having human-friendly labels, providing high coverage of the RDF data, and being efficient to compute. Experimental results on real-world RDF datasets show the approach produces compact schemas with high coverage and understandable labels that improve performance over the native RDF representation.
Talk at 3th Keystone Training School - Keyword Search in Big Linked Data - Institute for Software Technology and Interactive Systems, TU Wien, Austria, 2017
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Lecture Notes by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2014/01/sparql-rdf-query-language.html
and http://www.jarrar.info
The lecture covers:
- SPARQL Basics
- SPARQL Practical Session
Linked lists represent a countable number of ordered values, and are among the most important abstract data types in computer science. With the advent of RDF as a highly expressive knowledge representation language for the Web, various implementations for RDF lists have been proposed. Yet, there is no benchmark so far dedicated to evaluate the performance of triple stores and SPARQL query engines on dealing with ordered linked data. Moreover, essential tasks for evaluating RDF lists, like generating datasets containing RDF lists of various sizes, or generating the same RDF list using different modelling choices, are cumbersome and unprincipled. In this paper, we propose List.MID, a systematic benchmark for evaluating systems serving RDF lists. List.MID consists of a dataset generator, which creates RDF list data in various models and of different sizes; and a set of SPARQL queries. The RDF list data is coherently generated from a large, community-curated base collection of Web MIDI files, rich in lists of musical events of arbitrary length. We describe the List.MID benchmark, and discuss its impact and adoption, reusability, design, and availability.
This document discusses demos and tools for linking knowledge discovery (KDD) and linked data. It summarizes several tools that integrate linked data and KDD processes like data preprocessing, mining, and postprocessing. OpenRefine, RapidMiner, R, Matlab, ProLOD++, DL-Learner, Spark, KNIME, and Gephi were highlighted as tools that support tasks like enriching data, running SPARQL queries, loading RDF data, and visualizing linked data. The document concludes by asking about gaps and how to increase adoption, noting linked data could benefit KDD with validation, enrichment, and reasoning over semantic web data.
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
This document provides an introduction to the RDF data model. It describes RDF as a data model that represents data as subject-predicate-object triples that can be used to describe resources. These triples form a directed graph. The document provides examples of RDF triples and graphs, and compares the RDF data model to relational and XML data models. It also describes common RDF formats like RDF/XML, Turtle, N-Triples, and how RDF graphs from different sources can be merged.
Semantic pipes aggregate data from multiple sources to create new data sources, similar to Yahoo! Pipes. Semantic pipes operate on RDF data sources using SPARQL queries. DERI Pipes is a tool for building semantic pipes that defines blocks for processing RDF and other data sources. Semantic mashups may have additional reasoning capabilities beyond basic data aggregation, using semantic web reasoners. They implement behavior through SPARQL queries over RDF data. Examples include mashups over Flickr, book data, and scholarly references.
The document introduces R programming and data analysis. It covers getting started with R, data types and structures, exploring and visualizing data, and programming structures and relationships. The aim is to describe in-depth analysis of big data using R and how to extract insights from datasets. It discusses importing and exporting data, data visualization, and programming concepts like functions and apply family functions.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequent patterns and relationships between keywords in a collection of documents. Text classification algorithms such as support vector machines, k-nearest neighbors, naive Bayes, and neural networks can be applied to categorize documents based on their contents.
Jens Lehmann's overview of the use of semantics in the Big Data Europe Integrator Platform. Including the Semantic Data Lake (Ontario), and the SANSA Analytics Engine.
This document provides an overview of relevant approaches for accessing open data programmatically and data-as-a-service (DaaS) solutions. It discusses common data access methods like web APIs, OData, and SPARQL and describes several DaaS platforms that simplify publishing and consuming open data. It also outlines requirements for a proposed open DaaS platform called DaPaaS that aims to address challenges in open data management and application development.
Bernhard Haslhofer is a postdoc researcher at Cornell University studying linked data, user-contributed data, and data interoperability. He discusses Linked (Open) Data, which uses URIs and RDF to publish and link structured data on the web. The key principles are using URIs to identify things, providing useful information about those URIs when dereferenced, and including links to other URIs. Enabling technologies include URIs, RDF, RDFS/OWL for vocabularies, SPARQL for querying, and best practices for publishing vocabularies and data. Useful tools are also presented.
The web of interlinked data and knowledge strippedSören Auer
Linked Data approaches can help solve enterprise information integration (EII) challenges by complementing text on web pages with structured, linked open data from different sources. This allows for intelligently combining, integrating, and joining structured information across heterogeneous systems. A distributed, iterative, bottom-up integration approach using Linked Data may help solve the EII problem in large companies by taking a pay-as-you-go approach.
This document discusses rules and the Semantic Web Rule Language (SWRL). It defines rules as a means of representing knowledge similar to if-then statements. SWRL combines OWL and rule-based languages by allowing users to write rules that can refer to OWL classes, properties, individuals and datatypes. SWRL has an abstract and XML syntax and supports built-in predicates for manipulating data types. Rules provide more expressivity than RDFS and OWL in some cases, such as defining application behaviors, but rule-based reasoning is less performant so they should not be overused when RDFS/OWL suffice.
Lightening talk for Semantic Web in Libraries (SWIB13) conference at 2013-11-27 about another method of expressing RDF data. See http://gbv.github.io/aREF/ for a preliminary specification.
The International Federation of Library Associations and Institutions (IFLA) is responsible for the development and maintenance of International Standard Bibliographic Description (ISBD), UNIMARC, and the "Functional Requirements" family for bibliographic records (FRBR), authority data (FRAD), and subject authority data (FRSAD). ISBD underpins the MARC family of formats used by libraries world-wide for many millions of catalog records, while FRBR is a relatively new model optimized for users and the digital environment. These metadata models, schemas, and content rules are now being expressed in the Resource Description Framework language for use in the Semantic Web.
This webinar provides a general update on the work being undertaken. It describes the development of an Application Profile for ISBD to specify the sequence, repeatability, and mandatory status of its elements. It discusses issues involved in deriving linked data from legacy catalogue records based on monolithic and multi-part schemas following ISBD and FRBR, such as the duplication which arises from copy cataloging and FRBRization. The webinar provides practical examples of deriving high-quality linked data from the vast numbers of records created by libraries, and demonstrates how a shift of focus from records to linked-data triples can provide more efficient and effective user-centered resource discovery services.
Wi2015 - Clustering of Linked Open Data - the LODeX toolLaura Po
Presentation of the tool LODeX (http://www.dbgroup.unimore.it/lodex2/testCluster) at the 2015 IEEE/WIC/ACM International Conference on Web Intelligence, Singapore, December 6-8, 2015
This document discusses three methods for storing and querying RDF data using graph databases: universal, rigid, and flexible. The universal method uses a simple and unique data structure. The rigid method is schema-based. The flexible method is schema-adaptable. A preliminary prototype was created using the flexible method, which works better than the other two methods for some query types. Requirements for further development include a formal property graph data model definition and a standard query language.
Deriving an Emergent Relational Schema from RDF DataGraph-TA
This document discusses deriving an emergent relational schema from RDF data. It describes extracting characteristic sets from RDF data to recognize classes and relationships between classes. These characteristic sets are then merged and labeled to create a logical relational schema. This emergent schema provides benefits for both systems through improved efficiency and humans through easier query formulation over the RDF data. Key aspects of a useful emergent schema are discussed such as being compact, having human-friendly labels, providing high coverage of the RDF data, and being efficient to compute. Experimental results on real-world RDF datasets show the approach produces compact schemas with high coverage and understandable labels that improve performance over the native RDF representation.
Talk at 3th Keystone Training School - Keyword Search in Big Linked Data - Institute for Software Technology and Interactive Systems, TU Wien, Austria, 2017
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Lecture Notes by Mustafa Jarrar at Birzeit University, Palestine.
See the course webpage at: http://jarrar-courses.blogspot.com/2014/01/sparql-rdf-query-language.html
and http://www.jarrar.info
The lecture covers:
- SPARQL Basics
- SPARQL Practical Session
Linked lists represent a countable number of ordered values, and are among the most important abstract data types in computer science. With the advent of RDF as a highly expressive knowledge representation language for the Web, various implementations for RDF lists have been proposed. Yet, there is no benchmark so far dedicated to evaluate the performance of triple stores and SPARQL query engines on dealing with ordered linked data. Moreover, essential tasks for evaluating RDF lists, like generating datasets containing RDF lists of various sizes, or generating the same RDF list using different modelling choices, are cumbersome and unprincipled. In this paper, we propose List.MID, a systematic benchmark for evaluating systems serving RDF lists. List.MID consists of a dataset generator, which creates RDF list data in various models and of different sizes; and a set of SPARQL queries. The RDF list data is coherently generated from a large, community-curated base collection of Web MIDI files, rich in lists of musical events of arbitrary length. We describe the List.MID benchmark, and discuss its impact and adoption, reusability, design, and availability.
This document discusses demos and tools for linking knowledge discovery (KDD) and linked data. It summarizes several tools that integrate linked data and KDD processes like data preprocessing, mining, and postprocessing. OpenRefine, RapidMiner, R, Matlab, ProLOD++, DL-Learner, Spark, KNIME, and Gephi were highlighted as tools that support tasks like enriching data, running SPARQL queries, loading RDF data, and visualizing linked data. The document concludes by asking about gaps and how to increase adoption, noting linked data could benefit KDD with validation, enrichment, and reasoning over semantic web data.
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
This document provides an introduction to the RDF data model. It describes RDF as a data model that represents data as subject-predicate-object triples that can be used to describe resources. These triples form a directed graph. The document provides examples of RDF triples and graphs, and compares the RDF data model to relational and XML data models. It also describes common RDF formats like RDF/XML, Turtle, N-Triples, and how RDF graphs from different sources can be merged.
Semantic pipes aggregate data from multiple sources to create new data sources, similar to Yahoo! Pipes. Semantic pipes operate on RDF data sources using SPARQL queries. DERI Pipes is a tool for building semantic pipes that defines blocks for processing RDF and other data sources. Semantic mashups may have additional reasoning capabilities beyond basic data aggregation, using semantic web reasoners. They implement behavior through SPARQL queries over RDF data. Examples include mashups over Flickr, book data, and scholarly references.
The document introduces R programming and data analysis. It covers getting started with R, data types and structures, exploring and visualizing data, and programming structures and relationships. The aim is to describe in-depth analysis of big data using R and how to extract insights from datasets. It discusses importing and exporting data, data visualization, and programming concepts like functions and apply family functions.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequent patterns and relationships between keywords in a collection of documents. Text classification algorithms such as support vector machines, k-nearest neighbors, naive Bayes, and neural networks can be applied to categorize documents based on their contents.
Jens Lehmann's overview of the use of semantics in the Big Data Europe Integrator Platform. Including the Semantic Data Lake (Ontario), and the SANSA Analytics Engine.
This document provides an overview of relevant approaches for accessing open data programmatically and data-as-a-service (DaaS) solutions. It discusses common data access methods like web APIs, OData, and SPARQL and describes several DaaS platforms that simplify publishing and consuming open data. It also outlines requirements for a proposed open DaaS platform called DaPaaS that aims to address challenges in open data management and application development.
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...Matthäus Zloch
This document introduces a software framework and datasets for analyzing graph measures on RDF graphs. The framework includes a processing pipeline to acquire, prepare, and analyze RDF datasets. It calculates 28 graph measures across 5 groups (basic, degree-based, centrality, edge-based, descriptive statistics) on 280 RDF datasets from the LOD Cloud. Preliminary analysis shows variation in measures across domains. The framework and pre-processed datasets are available open-source to support large-scale graph-based analysis of RDF data.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
This document discusses standardizing data on the web. It notes that data exists in many formats, from informal to curated, and machine to human readable. W3C has focused on integrating data at web scale using standards like RDF, SPARQL, and Linked Data principles. However, converting all data to RDF has challenges. Much data exists as CSV, JSON, XML and does not need full integration. The reality is data on the web is messy with many formats. Developers see converting data as too complex. The document discusses providing tools to publish Linked Data easily, or focusing on raw data without RDF. It notes different approaches can coexist and discusses a workshop on open data formats.
Towards A Scalable Semantic-based Distributed Approach for SPARQL query evalu...Gezim Sejdiu
Over the last two decades, the amount of data which has been created, published and managed using Semantic Web standards and especially via Resource Description Framework (RDF) has been increasing.
As a result, efficient processing of such big RDF datasets has become challenging.
Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size.
In this study, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets using a semantic-based partition and is implemented inside the state-of-the-art RDF processing framework: SANSA.
An evaluation of the performance of our approach in processing large-scale RDF datasets is also presented.
The preliminary results of the conducted experiments show that our approach can scale horizontally and perform well as compared with the previous Hadoop-based system.
It is also comparable with the in-memory SPARQL query evaluators when there is less shuffling involved.
The document discusses data discovery, conversion, integration and visualization using RDF. It covers topics like ontologies, vocabularies, data catalogs, converting different data formats to RDF including CSV, XML and relational databases. It also discusses federated SPARQL queries to integrate data from multiple sources and different techniques for visualizing linked data including analyzing relationships, events, and multidimensional data.
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...Marta Villegas
This document discusses lessons learned from using linked open data in applications. It describes converting metadata from a language resource catalogue into RDF triples, including resolving complex instances like people and organizations. Issues addressed include data enrichment, linking to external datasets, and making implicit relations explicit. The goals of displaying comprehensive data to users and aggregating external data sensitively are discussed.
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
The document discusses the relationship between building information modeling (BIM) and the semantic web. It provides an introduction to linked data and describes how semantic web technologies can be used to add contextual and background knowledge to BIM data, such as geographical, historical, and statistical information. It also addresses challenges around preserving and maintaining the evolution of linked BIM and architecture data on the semantic web.
Talk delivered at YOW! Developer Conferences in Melbourne, Brisbane and Sydney Australia on 1-9 December 2016.
Abstract: Governments collect a lot of data. Data on air quality, toxic chemicals, laws and regulations, public health, and the census are intended to be widely distributed. Some data is not for public consumption. This talk focuses on open government data — the information that is meant to be made available for benefit of policy makers, researchers, scientists, industry, community organisers, journalists and members of civil society.
We’ll cover the evolution of Linked Data, which is now being used by Google, Apple, IBM Watson, federal governments worldwide, non-profits including CSIRO and OpenPHACTS, and thousands of others worldwide.
Next we’ll delve into the evolution of the U.S. Environmental Protection Agency’s Open Data service that we implemented using Linked Data and an Open Source Data Platform. Highlights include how we connected to hundreds of billions of open data facts in the world’s largest, open chemical molecules database PubChem and DBpedia.
WHO SHOULD ATTEND
Data scientists, software engineers, data analysts, DBAs, technical leaders and anyone interested in utilising linked data and open government data.
The document provides an introduction to Prof. Dr. Sören Auer and his background in knowledge graphs. It discusses his current role as a professor and director focusing on organizing research data using knowledge graphs. It also briefly outlines some of his past roles and major scientific contributions in the areas of technology platforms, funding acquisition, and strategic projects related to knowledge graphs.
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...CONUL Conference
The ADAPT Centre is funded under the SFI Research Centres Programme and is co-funded under the European Regional Development Fund. The document discusses two demonstrators that were developed to facilitate user engagement with library data from Trinity College Dublin by enriching the data with semantic technologies. The first demonstrator was a mobile application that used linked library data and geospatial information. The second demonstrator interlinked the library metadata with a dataset of Irish churches using spatial relationships and functions defined in GeoSPARQL.
Linked Data Driven Data Virtualization for Web-scale Integrationrumito
- Linked data and data virtualization can help address challenges of growing data heterogeneity, complexity, and need for agility by providing a common data model and identifiers.
- Linked data uses RDF to represent information as graphs of triples connected by URIs, allowing different data sources to be integrated and queried together.
- As more data is published using common vocabularies and linking to existing URIs, it increases opportunities for discovery, integration and novel ways to extract value from diverse data sources.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
This document provides an introduction to Linked Open Data (LOD) and how to create and work with LOD datasets. It explains that LOD combines Linked Data with open licenses to publish structured data on the web. It describes the 5 star open data model and shows how to make data available in different formats like RDF, JSON-LD and Turtle. It also introduces SPARQL and provides examples of querying LOD sources like Wikidata to find information. Finally, it discusses software for working with LOD and challenges like data quality and performance at large scales.
The document discusses generating high quality Linked Open Data using the RDF Mapping Language (RML). RML allows for the uniform and declarative generation of RDF from heterogeneous data sources through mapping rules. It supports assessing mapping quality to identify issues before data is generated. Metadata can also be automatically generated from the mappings. The document emphasizes that non-technical data specialists should be able to easily edit the mappings over time.
Similar to DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk (20)
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
2. 2 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Introduction
❖ Approach
❖ Evaluation
❖ Use Cases
❖ Conclusion and Future work
Outline
3. 3 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
4. 4 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ As of August 2018
5. 5 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ Based on LOD Stats (http://lodstats.aksw.org/ )
~10, 000 datasets
Openly available online
using Semantic Web
standards
6. 6 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ Based on LOD Stats (http://lodstats.aksw.org/ )
~10, 000 datasets
Openly available online
using Semantic Web
standards
many datasets
RDFized and kept private
(e.g. Supply chain,
manufacture, ethereum
dataset, etc.)
7. 7 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
8. 8 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Vocabulary
Reuse
Find a suitable
vocabulary for
your dataset
9. 9 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Vocabulary
Reuse
Find a suitable
vocabulary for
your dataset
Coverage
Analysis
Does dataset
contain
necessary
information?
10. 10 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Vocabulary
Reuse
Find a suitable
vocabulary for
your dataset
Coverage
Analysis
Does dataset
contain
necessary
information?
Privacy
Analysis
Does dataset
contain
sensitive
information?
11. 11 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Vocabulary
Reuse
Find a suitable
vocabulary for
your dataset
Coverage
Analysis
Does dataset
contain
necessary
information?
Privacy
Analysis
Does dataset
contain
sensitive
information?
Entity
Linking
Which datasets
are good
candidates for
interlinking?
12. 12 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
We need statistics!Vocabulary
Reuse
Find a suitable
vocabulary for
your dataset
Coverage
Analysis
Does dataset
contain
necessary
information?
Privacy
Analysis
Does dataset
contain
sensitive
information?
Entity
Linking
Which datasets
are good
candidates for
interlinking?
13. 13 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
but
(using an efficient approach)
Vocabulary
Reuse
Find a suitable
vocabulary for
your dataset
Coverage
Analysis
Does dataset
contain
necessary
information?
Privacy
Analysis
Does dataset
contain
sensitive
information?
Entity
Linking
Which datasets
are good
candidates for
interlinking?
We need statistics!
14. 14 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
SANSA Overview
❖ SANSA [1] is a processing data flow engine that provides data
distribution, and fault tolerance for distributed computations over
RDF large-scale datasets
❖ SANSA includes several libraries:
➢ Read / Write RDF / OWL library
➢ Querying library
➢ Inference library
➢ ML- Machine Learning core library
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
15. 15 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ A statistical criterion C (adopted definition from [2]) is a triple C = (F,
D, P), where:
➢ F is a SPARQL filter condition
➢ D is a derived dataset from the main dataset (RDD of triples) after applying
F
➢ P is a post-processing filter operating on the data structure D
❖ RDDs are in-memory collections of records that can be operated in
parallel on large clusters
➢ We use RDDs to represent RDF triples
Approach
16. 16 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Overview on DistLODStats architecture
Approach
17. 17 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
18. 18 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
Definitions
● Define rdf statistics in the format of [(Rule→ Filter) → Action → Postprocessing]
RDFStatistics
SANSA Engine
Statistics
ClassDistribution
PropertyDistribution
Indegree
Outdegree
Distinctentities
Literals
SameAs
Namespaces
---
Class Distribution
Filter rule Action Postprocessing
?p=rdf:type && isIRI(?o) M[?o]++ top(M,100)
19. 19 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
Definitions
● Define rdf statistics in the format of [(Rule→ Filter) → Action → Postprocessing]
RDFStatistics
SANSA Engine
Statistics
ClassDistribution
PropertyDistribution
Indegree
Outdegree
Distinctentities
Literals
SameAs
Namespaces
---
❖ We use RDDs to represent
RDF triples
val triples = spark.rdf(lang)(input)
val stats =
triples.statsClassUsageCount()
Class Distribution
Filter rule Action Postprocessing
?p=rdf:type && isIRI(?o) M[?o]++ top(M,100)
20. 20 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
Definitions
● Define rdf statistics in the format of [(Rule→ Filter) → Action → Postprocessing]
RDFStatistics
SANSA Engine
Statistics
ClassDistribution
PropertyDistribution
Indegree
Outdegree
Distinctentities
Literals
SameAs
Namespaces
---
filter_rule = triples.filter(f =>
f.predicateMatches(RDF.`type`.asNode
()) && f.getObject.isURI)
.map(_.getObject)
Worker Worker Worker
Class Distribution
Filter rule Action Postprocessing
?p=rdf:type && isIRI(?o) M[?o]++ top(M,100)
21. 21 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
Definitions
● Define rdf statistics in the format of [(Rule→ Filter) → Action → Postprocessing]
RDFStatistics
SANSA Engine
Statistics
ClassDistribution
PropertyDistribution
Indegree
Outdegree
Distinctentities
Literals
SameAs
Namespaces
---
action = filter_rule
.map(f => (f, 1))
.reduceByKey(_ + _)
Worker Worker Worker
5
:Person
2
:Place
1
:Work
6
:Org.
4
:Species
1
:Person
3
:Person
1
:Place
Class Distribution
Filter rule Action Postprocessing
?p=rdf:type && isIRI(?o) M[?o]++ top(M,100)
22. 22 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
Definitions
● Define rdf statistics in the format of [(Rule→ Filter) → Action → Postprocessing]
RDFStatistics
SANSA Engine
Statistics
ClassDistribution
PropertyDistribution
Indegree
Outdegree
Distinctentities
Literals
SameAs
Namespaces
---
result = action.sortBy(_._2, false)
.take(100)
Worker Worker Worker
5
:Person
2
:Place
1
:Work
6
:Org.
4
:Species
1
:Person
3
:Person
1
:Place
Master
9
:Person
6
:Org.
3
:Place
1
:Work
4
:Species
Class Distribution
Filter rule Action Postprocessing
?p=rdf:type && isIRI(?o) M[?o]++ top(M,100)
23. 23 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Approach (DistLODStats workflow)
RDF Data
Definitions
● Define rdf statistics in the format of [(Rule→ Filter) → Action → Postprocessing]
RDFStatistics
SANSA Engine
Statistics
ClassDistribution
PropertyDistribution
Indegree
Outdegree
Distinctentities
Literals
SameAs
Namespaces
---
result.voidify(output)
Worker Worker Worker
5
:Person
2
:Place
1
:Work
6
:Org.
4
:Species
1
:Person
3
:Person
1
:Place
Master
9
:Person
6
:Org.
3
:Place
1
:Work
4
:Species
Analyse
SANSA-Notebooks
Vocabulary of Interlinked Datasets
(VoID)
Class Distribution
Filter rule Action Postprocessing
?p=rdf:type && isIRI(?o) M[?o]++ top(M,100)
24. 24 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Q1: How does the runtime of the algorithm change when more
nodes in the cluster are added?
❖ Q2: How does the algorithm scale to larger datasets?
❖ Q3: How does the algorithm scale to a larger number of datasets?
Evaluation
26. 26 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Distributed Processing on Large-Scale Datasets
Evaluation
Runtime (h) (mean/std)
LODStats DistLODStats
a) files b) bigfile c) local d) cluster e) speedup ratio
LinkedGeoData n/a n/a 36.65/0.13 4.37/0.15 7.4x
DBpedia_en 24.63/0.57 fail 25.34/0.11 2.97/0.08 7.6x
DBpedia_de n/a n/a 10.34/0.06 1.2/0.0 7.3x
DBpedia_fr n/a n/a 10.49/0.09 1.27/0.04 7.3x
27. 27 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Distributed Processing on Large-Scale Datasets
Evaluation
Runtime (h) (mean/std)
LODStats DistLODStats
a) files b) bigfile c) local d) cluster e) speedup ratio
LinkedGeoData n/a n/a 36.65/0.13 4.37/0.15 7.4x
DBpedia_en 24.63/0.57 fail 25.34/0.11 2.97/0.08 7.6x
DBpedia_de n/a n/a 10.34/0.06 1.2/0.0 7.3x
DBpedia_fr n/a n/a 10.49/0.09 1.27/0.04 7.3x
28. 28 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Distributed Processing on Large-Scale Datasets
Evaluation
Runtime (h) (mean/std)
LODStats DistLODStats
a) files b) bigfile c) local d) cluster e) speedup ratio
LinkedGeoData n/a n/a 36.65/0.13 4.37/0.15 7.4x
DBpedia_en 24.63/0.57 fail 25.34/0.11 2.97/0.08 7.6x
DBpedia_de n/a n/a 10.34/0.06 1.2/0.0 7.3x
DBpedia_fr n/a n/a 10.49/0.09 1.27/0.04 7.3x
29. 29 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Distributed Processing on Large-Scale Datasets
Evaluation
Runtime (h) (mean/std)
LODStats DistLODStats
a) files b) bigfile c) local d) cluster e) speedup ratio
LinkedGeoData n/a n/a 36.65/0.13 4.37/0.15 7.4x
DBpedia_en 24.63/0.57 fail 25.34/0.11 2.97/0.08 7.6x
DBpedia_de n/a n/a 10.34/0.06 1.2/0.0 7.3x
DBpedia_fr n/a n/a 10.49/0.09 1.27/0.04 7.3x
* e) = d) / c) - 1
30. 30 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Speedup performance evaluation of DistLODStats
Evaluation
31. 31 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Speedup performance evaluation of DistLODStats
Evaluation
DistLODStats shows consistent
improvements for each
dataset when running on a
cluster with an geometric
mean of the speedup of 7.4x,
which answer Q1.
32. 32 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Sizeup performance evaluation of DistLODStats
Evaluation
33. 33 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Sizeup performance evaluation of DistLODStats
Evaluation
The execution time cost grows
linearly when the size of the
dataset increases and is
near-constant as long as the
data fits in memory.
DistLODStats scales well in
context of sizeup, which
answers Q2.
34. 34 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Scalability performance evaluation on DistLODStats
Evaluation
35. 35 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Scalability performance evaluation on DistLODStats
Evaluation
We can see that as the
number of workers increases,
the execution time cost is
super-linear on BSBM_50GB
dataset.
36. 36 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Speedup Ratio and Efficiency of DistLODStats
Evaluation
37. 37 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Speedup Ratio and Efficiency of DistLODStats
Evaluation
The speedup performance
trend is consistent as the
number of workers increases.
Efficiency increased only up to
the 4th worker for
BSBM_50GB dataset.
The results imply that
DistLODStats can achieve near
linear or even super linear
scalability in performance,
which answers Q3.
* speedup (S) =
* efficiency (E) =
- execution time of the algorithm
on local mode
- time required to complete the
task on N workers
38. 38 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Powered by
<http://lodstats.aksw.org/>
Comprehensive
statistics – LODStats
DistLODStats is used as
underlying engine overcoming
the previous limitations and
generating statistical
descriptions, including e.g. VoID,
for large parts of the Linked Open
Data Cloud
39. 39 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Powered by
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in general
and DistLODStats specifically in
order to perform large-scale
batch analytics, e.g. computing
the asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. DistLODStats
was run on a 100 node cluster
with 400 cores
<http://lodstats.aksw.org/>
Comprehensive
statistics – LODStats
DistLODStats is used as
underlying engine overcoming
the previous limitations and
generating statistical
descriptions, including e.g. VoID,
for large parts of the Linked Open
Data Cloud
40. 40 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Powered by
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in general
and DistLODStats specifically in
order to perform large-scale
batch analytics, e.g. computing
the asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. DistLODStats
was run on a 100 node cluster
with 400 cores
<http://lodstats.aksw.org/>
Comprehensive
statistics – LODStats
DistLODStats is used as
underlying engine overcoming
the previous limitations and
generating statistical
descriptions, including e.g. VoID,
for large parts of the Linked Open
Data Cloud
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
DistLODStats is used for
computing statistics over those
logs within the BDE platform. BDE
uses the Mu Swarm Logger
service for detecting docker
events and convert their
representation to RDF. In order to
generate visualisations of log
statistics, BDE then calls
DistLODStats from
SANSA-Notebooks
41. 41 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Powered by
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in general
and DistLODStats specifically in
order to perform large-scale
batch analytics, e.g. computing
the asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. DistLODStats
was run on a 100 node cluster
with 400 cores
<http://lodstats.aksw.org/>
Comprehensive
statistics – LODStats
DistLODStats is used as
underlying engine overcoming
the previous limitations and
generating statistical
descriptions, including e.g. VoID,
for large parts of the Linked Open
Data Cloud
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
DistLODStats is used for
computing statistics over those
logs within the BDE platform. BDE
uses the Mu Swarm Logger
service for detecting docker
events and convert their
representation to RDF. In order to
generate visualisations of log
statistics, BDE then calls
DistLODStats from
SANSA-Notebooks
<http://abstat.disco.unimib.it/>
LOD Summaries –
ABSTAT
DistLODStats is used for data set
summarisation of large-scale
RDF datasets in this context
42. 42 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Obtaining an overview over the Web of Data:
➢ data-intensive and computing-intensive
➢ challenge to develop fast and efficient algorithms that can handle large
scale RDF datasets
❖ DistLODStats, a novel software component (integrated into the
larger SANSA framework) for distributed in-memory computation of
RDF Datasets statistics implemented using the Spark framework
❖ As a future work we plan to further improve time efficiency and
perform load balancing
Conclusion
43. 43 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
STATisfy: A REST Interface for DistLODStats
CollaborativeAnalyticsServices
Marketplace
REST
Server
BigDataEurope
Local Cluster
Standalone Resource manager
Master
Worker 1 Worker 2 Worker n
SANSA DistLODStats
44. 44 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
STATisfy: A REST Interface for DistLODStats
CollaborativeAnalyticsServices
Marketplace
REST
Server
BigDataEurope
Local Cluster
Standalone Resource manager
Master
Worker 1 Worker 2 Worker n
SANSA DistLODStats
Visit our poster [P29] at 7pm!
46. 46 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
[1]. J. Lehmann, G. Sejdiu, L. Buhmann, P. Westphal, C. Stadler, I. Ermilov, S. Bin, ¨ N.
Chakraborty, M. Saleem, A.-C. N. Ngonga, and H. Jabeen. Distributed Semantic Analytics
using the SANSA Stack. In Proceedings of 16th International Semantic Web Conference,
2017
[2]. J. Demter, S. Auer, M. Martin, and J. Lehmann. Lodstats—an extensible framework for
high-performance dataset analytics. In Proceedings of the EKAW 2012, Lecture Notes in
Computer Science (LNCS) 7603. Springer, 2012
References
47. 47 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Backup Slides
48. 48 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ As of May 2007
Source: LOD-Cloud (http://lod-cloud.net/ )
49. 49 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ As of September 2008
Source: LOD-Cloud (http://lod-cloud.net/ )
50. 50 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
Introduction
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ As of September 2011
Source: LOD-Cloud (http://lod-cloud.net/ )
51. 51 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Overall Breakdown by Criterion Analysis (log scale)
Evaluation
52. 52 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Overall Breakdown by Criterion Analysis (log scale)
Evaluation
The execution time is longer
when there is data movement
in the cluster compared to
when data is processed
without movement.
53. 53 DistLODStats: Distributed Computation of RDF Dataset Statistics Gezim Sejdiu - University of Bonn
❖ Overall Breakdown by Criterion Analysis (log scale)
Evaluation
There are some criteria that
are quite efficient to compute
even with data movement e.g.
22, 23. This is because data is
largely filtered before the
movement.