This document summarizes recent approaches to web data management including Fusion Tables, XML, and Linked Open Data (LOD). It discusses properties of web data like lack of schema, volatility, and scale. LOD uses RDF, global identifiers (URIs), and data links to query and integrate data from multiple sources while maintaining source autonomy. The LOD cloud has grown rapidly, currently consisting of over 3000 datasets with more than 84 billion triples.
Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the āsemantic webā. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDFĀencoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleonĀdb project (joint work with GĆ¼nes AluƧ, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).
Visualising the Australian open data and research data landscapeJonathan Yu
Ā
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
The presentation explores the trend towards a scholarly communication system that is friendly to machines. It presents 3 exhibits illustrating the trend and 1 exhibit illustrating inertia in the system. It makes the point that machine-actionability can be much easier achieved if content and metadata are available in Open Access and under a permissive Creative Commons license. It also observes that even with content and metadata openly available, new costs related to advanced tools to explore the scholarly record will emerge. Finally, it points at significant challenges regarding the persistence of the scholarly record in light of increasingly interconnected and actionable content and advanced tools to interact with it.
The slides were used for a plenary presentation at the LIBER 2011 Conference in Barcelona, Spain, on June 30 2011.
A presentation by Gill Hamilton, Digital Access Manager at the National Library of Scotland (NLS).
Delivered at the Cataloguing and Indexing Group Scotland (CIGS) Linked Open Data (LOD) Conference which took place Fri 21 September 2012 at the Edinburgh Centre for Carbon Innovation.
Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the āsemantic webā. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDFĀencoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleonĀdb project (joint work with GĆ¼nes AluƧ, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).
Visualising the Australian open data and research data landscapeJonathan Yu
Ā
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
The presentation explores the trend towards a scholarly communication system that is friendly to machines. It presents 3 exhibits illustrating the trend and 1 exhibit illustrating inertia in the system. It makes the point that machine-actionability can be much easier achieved if content and metadata are available in Open Access and under a permissive Creative Commons license. It also observes that even with content and metadata openly available, new costs related to advanced tools to explore the scholarly record will emerge. Finally, it points at significant challenges regarding the persistence of the scholarly record in light of increasingly interconnected and actionable content and advanced tools to interact with it.
The slides were used for a plenary presentation at the LIBER 2011 Conference in Barcelona, Spain, on June 30 2011.
A presentation by Gill Hamilton, Digital Access Manager at the National Library of Scotland (NLS).
Delivered at the Cataloguing and Indexing Group Scotland (CIGS) Linked Open Data (LOD) Conference which took place Fri 21 September 2012 at the Edinburgh Centre for Carbon Innovation.
Slides used for a presentation at the CNI 2013 Fall meeting. Discusses the problem domain of the Hiberlink project, a collaboration between the Los Alamos National Laboratory and the University of Edinburgh, funded by the Andrew W. Mellon Foundation. Hiberlink investigates reference rot in web-based scholarly communication.
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
Ā
Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. In this talk I argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. In this talk I will first define the boundaries of what constitutes a single coherent knowledge graph within Linked Data, i.e., present a principled notion of what a dataset is and what links within and between datasets are. I will also define different link types for data in Linked datasets and present the results of our empirical analysis of linkage among the datasets of the Linked Open Data cloud. Recent results from our analysis of Wikidata, which has not been part of the Linked Open Data Cloud, will also be presented.
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTHerbert Van de Sompel
Ā
DBpedia is the Linked Data version of Wikipedia. Starting in 2007, several DBpedia dumps have been made available for download. In 2010, the Research Library at the Los Alamos National Laboratory used these dumps to deploy a Memento-compliant DBpedia Archive, in order to demonstrate the applicability and appeal of accessing temporal versions of Linked Data sets using the Memento āTime Travel for the Webā protocol. The archive supported datetime negotiation to access various temporal versions of RDF descriptions of DBpedia subject URIs.
In a recent collaboration with the iMinds Group of Ghent University, the DBpedia Archive received a major overhaul. The initial MongoDB storage approach, which was unable to handle increasingly large DBpedia dumps, was replaced by HDT, the Binary RDF Representation for Publication and Exchange. And, in addition to the existing subject URI access point, Triple Pattern Fragments access, as proposed by the Linked Data Fragments project, was added. This allows datetime negotiation for URIs that identify RDF triples that match subject/predicate/object patterns. To add this powerful capability, native Memento support was added to the Linked Data Fragments Server of Ghent University.
In this talk, we will include a brief refresher of Memento, and will cover Linked Data Fragments, Triple Pattern Fragments, and HDT in more detail. We will share lessons learned from this effort and demo the new DBpedia Archive, which, at this point, holds over 5 billion RDF triples.
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Ā
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
Presentation delivered at the Linked Ancient World Data Institute, Drew University, 30 May 2013.
Copyright 2013 New York University.
This work is licensed under a Creative Commons Attribution 4.0 International License.
http://creativecommons.org/licenses/by/4.0/deed.en_US
Funding for the preparation and presentation of this presentation and the workshop at which it was presented was provided by the National Endowment for the Humanities. Any views, findings, conclusions, or recommendations expressed in this presentation do not necessarily reflect those of the National Endowment for the Humanities.
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
Ā
Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
https://doi.org/10.6084/m9.figshare.11854626.v1
Presented at Dutch National Librarian/Information Professianal Association annual conference 2011 - NVB2011
November 17, 2011
Australian Open government and research data pilot survey 2017Jonathan Yu
Ā
Australian Open data pilot survey conducted October 2017 leveraging indexed datasets across government and research sources via the CSIRO Knowledge Network (http://kn.csiro.au). Please note, these are preliminary results using our prototype quantitative methodology to assess volume, variety and velocity of open data initiatives across Australia. Lots of sources missing (we'd love to hear feedback about which ones would be good to include in the future!). Future work include addressing gaps in sources list, de-duplication of cross-indexed datasets, quantifying web services data, and an online version of the analysis.
This presentation was provided by Ashley Clark, Northeastern University, during a NISO Virtual Conference on the topic of data curation, held on Wednesday, August 31, 2016
The current status of Linked Open Data (LOD) shows evidence of many datasets available on the Web in RDF. In the meantime, there are still many challenges to overcome by organizations in their journey of publishing five stars datasets on the Web. Those challenges are not only technical, but are also organizational. At this moment where connectionist AI is gaining a wave of popularity with many applications, LOD needs to go beyond the guarantee of FAIR principles. One direction is to build a sustainable LOD ecosystem with FAIR-S principles. In parallel, LOD should serve as a catalyzer for solving societal issues (LOD for Social Good) and personal empowerment through data (Social Linked Data).
Slides used for a presentation at the CNI 2013 Fall meeting. Discusses the problem domain of the Hiberlink project, a collaboration between the Los Alamos National Laboratory and the University of Edinburgh, funded by the Andrew W. Mellon Foundation. Hiberlink investigates reference rot in web-based scholarly communication.
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
Ā
Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. In this talk I argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. In this talk I will first define the boundaries of what constitutes a single coherent knowledge graph within Linked Data, i.e., present a principled notion of what a dataset is and what links within and between datasets are. I will also define different link types for data in Linked datasets and present the results of our empirical analysis of linkage among the datasets of the Linked Open Data cloud. Recent results from our analysis of Wikidata, which has not been part of the Linked Open Data Cloud, will also be presented.
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTHerbert Van de Sompel
Ā
DBpedia is the Linked Data version of Wikipedia. Starting in 2007, several DBpedia dumps have been made available for download. In 2010, the Research Library at the Los Alamos National Laboratory used these dumps to deploy a Memento-compliant DBpedia Archive, in order to demonstrate the applicability and appeal of accessing temporal versions of Linked Data sets using the Memento āTime Travel for the Webā protocol. The archive supported datetime negotiation to access various temporal versions of RDF descriptions of DBpedia subject URIs.
In a recent collaboration with the iMinds Group of Ghent University, the DBpedia Archive received a major overhaul. The initial MongoDB storage approach, which was unable to handle increasingly large DBpedia dumps, was replaced by HDT, the Binary RDF Representation for Publication and Exchange. And, in addition to the existing subject URI access point, Triple Pattern Fragments access, as proposed by the Linked Data Fragments project, was added. This allows datetime negotiation for URIs that identify RDF triples that match subject/predicate/object patterns. To add this powerful capability, native Memento support was added to the Linked Data Fragments Server of Ghent University.
In this talk, we will include a brief refresher of Memento, and will cover Linked Data Fragments, Triple Pattern Fragments, and HDT in more detail. We will share lessons learned from this effort and demo the new DBpedia Archive, which, at this point, holds over 5 billion RDF triples.
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Ā
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
Presentation delivered at the Linked Ancient World Data Institute, Drew University, 30 May 2013.
Copyright 2013 New York University.
This work is licensed under a Creative Commons Attribution 4.0 International License.
http://creativecommons.org/licenses/by/4.0/deed.en_US
Funding for the preparation and presentation of this presentation and the workshop at which it was presented was provided by the National Endowment for the Humanities. Any views, findings, conclusions, or recommendations expressed in this presentation do not necessarily reflect those of the National Endowment for the Humanities.
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
Ā
Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
https://doi.org/10.6084/m9.figshare.11854626.v1
Presented at Dutch National Librarian/Information Professianal Association annual conference 2011 - NVB2011
November 17, 2011
Australian Open government and research data pilot survey 2017Jonathan Yu
Ā
Australian Open data pilot survey conducted October 2017 leveraging indexed datasets across government and research sources via the CSIRO Knowledge Network (http://kn.csiro.au). Please note, these are preliminary results using our prototype quantitative methodology to assess volume, variety and velocity of open data initiatives across Australia. Lots of sources missing (we'd love to hear feedback about which ones would be good to include in the future!). Future work include addressing gaps in sources list, de-duplication of cross-indexed datasets, quantifying web services data, and an online version of the analysis.
This presentation was provided by Ashley Clark, Northeastern University, during a NISO Virtual Conference on the topic of data curation, held on Wednesday, August 31, 2016
The current status of Linked Open Data (LOD) shows evidence of many datasets available on the Web in RDF. In the meantime, there are still many challenges to overcome by organizations in their journey of publishing five stars datasets on the Web. Those challenges are not only technical, but are also organizational. At this moment where connectionist AI is gaining a wave of popularity with many applications, LOD needs to go beyond the guarantee of FAIR principles. One direction is to build a sustainable LOD ecosystem with FAIR-S principles. In parallel, LOD should serve as a catalyzer for solving societal issues (LOD for Social Good) and personal empowerment through data (Social Linked Data).
Information Extraction and Linked Data CloudDhaval Thakker
Ā
In the media industry there is a great emphasis on providing descriptive metadata as part of the media assets to the consumers. Information extraction (IE) is considered an important tool for metadata generation process and its performance largely depend on the knowledge base it utilizes. The advances in the āLinked Data Cloudā research provide a great opportunity for generating such knowledge base that benefit from the participation of wider community. In this talk, I will discuss our experiences of utilizing Linked Data Cloud in conjunction with a GATE-based IE system.
This presentation gives a brief overview on achievements and challenges of the Data Web and describes different aspects of using the Semantic Data Wiki OntoWiki for Linked Data management.
Talk delivered at YOW! Developer Conferences in Melbourne, Brisbane and Sydney Australia on 1-9 December 2016.
Abstract: Governments collect a lot of data. Data on air quality, toxic chemicals, laws and regulations, public health, and the census are intended to be widely distributed. Some data is not for public consumption. This talk focuses on open government data ā the information that is meant to be made available for benefit of policy makers, researchers, scientists, industry, community organisers, journalists and members of civil society.
Weāll cover the evolution of Linked Data, which is now being used by Google, Apple, IBM Watson, federal governments worldwide, non-profits including CSIRO and OpenPHACTS, and thousands of others worldwide.
Next weāll delve into the evolution of the U.S. Environmental Protection Agencyās Open Data service that we implemented using Linked Data and an Open Source Data Platform. Highlights include how we connected to hundreds of billions of open data facts in the worldās largest, open chemical molecules database PubChem and DBpedia.
WHO SHOULD ATTEND
Data scientists, software engineers, data analysts, DBAs, technical leaders and anyone interested in utilising linked data and open government data.
An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
Pragmatic Approaches to the Semantic WebMike Bergman
Ā
Mike Bergman offers his take on what approaches to the semantic Web are working, what are not, and what all of this might say about the semantic Web moving forward. Informed by Structured Dynamics' open source frameworks and client experiences, the main thesis is that the pragmatic contribution of semantic technologies resides more in mindsets, information models and architectures than in 'linked data' as currently practiced.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Ā
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Ā
Brad Spiegel Macon GAās journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, heās not only bridging the gap in Macon but also setting an example for others to follow.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Ā
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your websiteās performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
Italy Agriculture Equipment Market Outlook to 2027harveenkaur52
Ā
Agriculture and Animal Care
Ken Research has an expertise in Agriculture and Animal Care sector and offer vast collection of information related to all major aspects such as Agriculture equipment, Crop Protection, Seed, Agriculture Chemical, Fertilizers, Protected Cultivators, Palm Oil, Hybrid Seed, Animal Feed additives and many more.
Our continuous study and findings in agriculture sector provide better insights to companies dealing with related product and services, government and agriculture associations, researchers and students to well understand the present and expected scenario.
Our Animal care category provides solutions on Animal Healthcare and related products and services, including, animal feed additives, vaccination
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Ā
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
1. Web Data Management in RDF Age
M. Tamer ĀØOzsu
University of Waterloo
David R. Cheriton School of Computer Science
1ICDCSā17/2017-06-07
2. Acknowledgements
This presentation draws upon collaborative research and
discussions with the following colleagues (in alphabetical order)
GĀØuneĀøs AluĀøc, University of Waterloo; now at SAP
Khuzaima Daudjee, University of Waterloo
Olaf Hartig, University of Waterloo; now at LinkĀØoping Univ.
Lei Chen, Hong Kong University of Science & Technology
Lei Zou, Peking University
2ICDCSā17/2017-06-07
3. Web Data Management
A long term research interest in the DB community
2000 2004
2011 2011
3ICDCSā17/2017-06-07
4. Interest Due to Properties of Web Data
Lack of a schema
Data is at best āsemi-structuredā
Missing data, additional attributes, similar data but not
identical
Volatility
Changes frequently
May conform to one schema now, but not later
Scale
Does it make sense to talk about a schema for Web?
How do you capture āeverythingā?
Querying diļ¬culty
What is the user language?
What are the primitives?
Arent search engines or metasearch engines suļ¬cient?
4ICDCSā17/2017-06-07
5. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
5ICDCSā17/2017-06-07
6. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
XML
Data exchange language
Primarily tree based structure
<list title="MOVIES">
<film>
<title>The Shining</title>
<director>Stanley Kubrick</director>
<actor>Jack Nicholson</actor>
</film>
<film>
<title>Spartacus</title>
<director>Stanley Kubrick</director>
</film>
<film>
<title>The Passenger</title>
<actor>Jack Nicholson</actor>
</film>
...
</list>
root
ļ¬lm
title
āThe Shiningā
director
āStanley Kubrickā
actor
āJack Nicholsonā
ļ¬lm
...
ļ¬lm
title
āThe Passengerā
actor
āJack Nicholsonā
5ICDCSā17/2017-06-07
7. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
XML
Data exchange language
Primarily tree based structure
Linked Open Data (LOD)
W3C work; community eļ¬ort
Maintains autonomy of data sources
Low barrier to entry
5ICDCSā17/2017-06-07
9. Linked Data Publishing Principles
IMDb World
Book
(http://...linkedmdb.../Shining,releaseDate, 23 May 1980)
(http://...linkedmdb.../Shining, ļ¬lmLocation, http://cia.../UK)
(http://...linkedmdb.../29704,actedIn, http://...linkedmdb.../Shining)
...
(http://cia.../UK, hasPopulation, 63230000)
...
Shining
UK
Data model: RDF
Global identiļ¬er: URI
Access mechanism: HTTP
Connection: data links
7ICDCSā17/2017-06-07
11. LOD Data Volumes . . .
. . . are growing ā and fast
Linked data cloud currently consists of 3000 datasets with
>84B triples
Size almost doubling every year
9ICDCSā17/2017-06-07
15. LOD Data Volumes . . .
. . . are growing ā and fast
Linked data cloud currently consists of 3000 datasets with
>84B triples
Size almost doubling every year
April ā14:
570 datasets, ???
triples
9ICDCSā17/2017-06-07
Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of Linked
Data Best Practices in Diļ¬erent Topical Domains. In Proc. ISWC, 2014.
17. Outline
1 RDF Technology [ĀØOzsu, 2016]
Data Warehousing Approach
Distributed RDF Processing
2 Federated RDF Systems
SPARQL Endpoint Federation
General RDF Federation
3 LOD ā Live Querying Approach [Hartig, 2013a]
Traversal-based approaches
Index-based approaches
Hybrid approaches
4 Conclusions
11ICDCSā17/2017-06-07
18. Outline
1 RDF Technology [ĀØOzsu, 2016]
Data Warehousing Approach
Distributed RDF Processing
2 Federated RDF Systems
SPARQL Endpoint Federation
General RDF Federation
3 LOD ā Live Querying Approach [Hartig, 2013a]
Traversal-based approaches
Index-based approaches
Hybrid approaches
4 Conclusions
12ICDCSā17/2017-06-07
19. RDF Introduction
Everything is an uniquely named
resource
http://data.linkedmdb.org/resource/actor/JN29704
13ICDCSā17/2017-06-07
20. RDF Introduction
Everything is an uniquely named
resource
Preļ¬xes can be used to shorten the
names
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
13ICDCSā17/2017-06-07
21. RDF Introduction
Everything is an uniquely named
resource
Preļ¬xes can be used to shorten the
names
Properties of resources can be
deļ¬ned
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName āJack Nicholsonā
y:JN29704:BornOnDate ā1937-04-22ā
13ICDCSā17/2017-06-07
22. RDF Introduction
Everything is an uniquely named
resource
Preļ¬xes can be used to shorten the
names
Properties of resources can be
deļ¬ned
Relationships with other resources
can be deļ¬ned
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName āJack Nicholsonā
y:JN29704:BornOnDate ā1937-04-22ā
y:TS2014:title āThe Shiningā
y:TS2014:releaseDate ā1980-05-23ā
y:TS2014
JN29704:movieActor
13ICDCSā17/2017-06-07
23. RDF Introduction
Everything is an uniquely named
resource
Preļ¬xes can be used to shorten the
names
Properties of resources can be
deļ¬ned
Relationships with other resources
can be deļ¬ned
Resource descriptions can be
contributed by diļ¬erent
people/groups and can be located
anywhere in the web
Integrated web ādatabaseā
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName āJack Nicholsonā
y:JN29704:BornOnDate ā1937-04-22ā
y:TS2014:title āThe Shiningā
y:TS2014:releaseDate ā1980-05-23ā
y:TS2014
JN29704:movieActor
13ICDCSā17/2017-06-07
24. RDF Data Model
Triple: Subject, Predicate (Property),
Object (s, p, o)
Subject: the entity that is described
(URI or blank node)
Predicate: a feature of the entity (URI)
Object: value of the feature (URI,
blank node or literal)
(s, p, o) ā (U āŖ B) Ć U Ć (U āŖ B āŖ L)
Set of RDF triples is called an RDF graph
U
Subject Object
U B U B L
U: set of URIs
B: set of blank nodes
L: set of literals
Predicate
Subject Predicate Object
http://...imdb.../ļ¬lm/2014 rdfs:label āThe Shiningā
http://...imdb.../ļ¬lm/2014 movie:releaseDate ā1980-05-23ā
http://...imdb.../29704 movie:actor name āJack Nicholsonā
. . . . . . . . .
14ICDCSā17/2017-06-07
27. UniProt in RDF http://www.uniprot.org
UniProt collects data from >150 biological resources
Claim: ālack of a common standard to represent and link
information makes data integration an expensive businessā ā
RDF can help
17ICDCSā17/2017-06-07
28. UniProt in RDF ā What does the data look like?
UniProt accession for the human CYP51 protein ā Q16850
Encode it as RDF:
18ICDCSā17/2017-06-07
http://purl.uniprot.org/uniprot/Q16850.rdf
29. UniProt in RDF ā What does the data look like?
UniProt accession for the human CYP51 protein ā Q16850
Encode it as RDF:
XML/RDF format
<rdf:Description 2-Āærdf:about=āhttp://purl.uniprot.org/citations/8619637ā>
<rdf:type 2-Āærdf:resource=āhttp://purl.uniprot.org/core/Journal Citationā/>
<title>The ubiquitously expressed human CYP51 encodes lanosterol 14 alpha-demethylase, a
cytochrome P450 whose expression is regulated by oxysterols.</title>
<author>Stroemstedt M.</author>
<author>Rozman D.</author>
<author>Waterman M.R.</author>
<skos:exactMatch rdf:resource=āhttp://purl.uniprot.org/pubmed/8619637ā/>
<foaf:primaryTopicOf rdf:resource=āhttps://www.ncbi.nlm.nih.gov/pubmed/8619637ā/>
<dcterms:identiļ¬er>doi:10.1006/abbi.1996.0193</dcterms:identiļ¬er>
<date rdf:datatype=āhttp://www.w3.org/2001/XMLSchema#gYearā>1996</date>
<name>Arch. Biochem. Biophys.</name>
<volume>329</volume>
<pages>73-81</pages>
</rdf:Description>
Subject
Predicate
Object
18ICDCSā17/2017-06-07
http://purl.uniprot.org/uniprot/Q16850.rdf
30. UniProt in RDF ā What does the data look like?
UniProt accession for the human CYP51 protein ā Q16850
Encode it as RDF:
XML/RDF format
<rdf:Description 2-Āærdf:about=āhttp://purl.uniprot.org/citations/8619637ā>
<rdf:type 2-Āærdf:resource=āhttp://purl.uniprot.org/core/Journal Citationā/>
<title>The ubiquitously expressed human CYP51 encodes lanosterol 14 alpha-demethylase, a
cytochrome P450 whose expression is regulated by oxysterols.</title>
<author>Stroemstedt M.</author>
<author>Rozman D.</author>
<author>Waterman M.R.</author>
<skos:exactMatch rdf:resource=āhttp://purl.uniprot.org/pubmed/8619637ā/>
<foaf:primaryTopicOf rdf:resource=āhttps://www.ncbi.nlm.nih.gov/pubmed/8619637ā/>
<dcterms:identiļ¬er>doi:10.1006/abbi.1996.0193</dcterms:identiļ¬er>
<date rdf:datatype=āhttp://www.w3.org/2001/XMLSchema#gYearā>1996</date>
<name>Arch. Biochem. Biophys.</name>
<volume>329</volume>
<pages>73-81</pages>
</rdf:Description>
This can be shown as a table <Subject, Predicate, Object >
Subject
Predicate
Object
18ICDCSā17/2017-06-07
http://purl.uniprot.org/uniprot/Q16850.rdf
31. RDF Query Model ā SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of
variables), a SPARQL expression is deļ¬ned recursively:
an atomic triple pattern, which is an element of
(U āŖ V ) Ć (U āŖ V ) Ć (U āŖ V āŖ L)
?x rdfs:label āThe Shiningā
P FILTER R, where P is a graph pattern expression and R is a
built-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p > 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph
pattern expressions
19ICDCSā17/2017-06-07
32. RDF Query Model ā SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of
variables), a SPARQL expression is deļ¬ned recursively:
an atomic triple pattern, which is an element of
(U āŖ V ) Ć (U āŖ V ) Ć (U āŖ V āŖ L)
?x rdfs:label āThe Shiningā
P FILTER R, where P is a graph pattern expression and R is a
built-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p > 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph
pattern expressions
Example:
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ā Stanley Kubrick ā .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r > 4.0)
}
19ICDCSā17/2017-06-07
33. SPARQL Queries
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ā Stanley Kubrick ā .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r > 4.0)
}
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
āStanley Kubrickā
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
20ICDCSā17/2017-06-07
34. UniProt in RDF ā The Data Can Be Queried
RDF encoded UniProt data can be queried using SPARQL:
http://sparql.uniprot.org/sparql
21ICDCSā17/2017-06-07
35. UniProt in RDF ā The Data Can Be Queried
RDF encoded UniProt data can be queried using SPARQL:
http://sparql.uniprot.org/sparql
Get the GO function for Q16850 (from UniProt SPARQL endpoint)
PREFIX upc:< http :// p u r l . u n i p r o t . org / core/>
PREFIX r d f : <http ://www. w3 . org /1999/02/22ā rdf āsyntaxāns#>
SELECT ? goid ? g o l a b e l
WHERE {
<http :// p u r l . u n i p r o t . org / u n i p r o t /Q16850> a upc : Protein ;
upc : c l a s s i f i e d W i t h ? keyword .
? keyword r d f s : seeAlso ? goid .
? goid r d f s : l a b e l ? g o l a b e l .
}
Find the diļ¬erential expression of probes and the p
value that map to Q16850 (from Expression Atlas SPARQL endpoint)
PREFIX r d f s : <http ://www. w3 . org /2000/01/ rdf āschema#>
PREFIX a t l a s t e r m s : <http :// r d f . ebi . ac . uk/ terms / a t l a s />
SELECT d i s t i n c t ? valueLabel ? pvalue
WHERE {
? value r d f s : l a b e l ? valueLabel .
? value a t l a s t e r m s : pValue ? pvalue .
? value a t l a s t e r m s : isMeasurementOf ? probe .
? probe a t l a s t e r m s : dbXref <http :// p u r l . u n i p r o t . org / u n i p r o t /Q16850> .
}
ORDER BY ASC(? pvalue )
21ICDCSā17/2017-06-07
36. NaĀØıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ā Stanley Kubrick ā .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r > 4.0)
}
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:initial release date ā1980-05-23ā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2014 movie:actor mdb:actor/29704
mdb:ļ¬lm/2014 movie:actor mdb: actor/30013
mdb:ļ¬lm/2014 movie:music contributor mdb: music contributor/4110
mdb:ļ¬lm/2014 foaf:based near geo:2635167
mdb:ļ¬lm/2014 movie:relatedBook bm:0743424425
mdb:ļ¬lm/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name āStanley Kubrickā
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:ļ¬lm/424 movie:director mdb:director/8476
mdb:ļ¬lm/424 rdfs:label āSpartacusā
mdb:actor/29704 movie:actor name āJack Nicholsonā
mdb:ļ¬lm/1267 movie:actor mdb:actor/29704
mdb:ļ¬lm/1267 rdfs:label āThe Last Tycoonā
mdb:ļ¬lm/3418 movie:actor mdb:actor/29704
mdb:ļ¬lm/3418 rdfs:label āThe Passengerā
geo:2635167 gn:name āUnited Kingdomā
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOļ¬er bm:oļ¬ers/0743424425amazonOļ¬er
lexvo:iso639-3/eng rdfs:label āEnglishā
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
22ICDCSā17/2017-06-07
37. NaĀØıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ā Stanley Kubrick ā .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r > 4.0)
}
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:initial release date ā1980-05-23ā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2014 movie:actor mdb:actor/29704
mdb:ļ¬lm/2014 movie:actor mdb: actor/30013
mdb:ļ¬lm/2014 movie:music contributor mdb: music contributor/4110
mdb:ļ¬lm/2014 foaf:based near geo:2635167
mdb:ļ¬lm/2014 movie:relatedBook bm:0743424425
mdb:ļ¬lm/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name āStanley Kubrickā
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:ļ¬lm/424 movie:director mdb:director/8476
mdb:ļ¬lm/424 rdfs:label āSpartacusā
mdb:actor/29704 movie:actor name āJack Nicholsonā
mdb:ļ¬lm/1267 movie:actor mdb:actor/29704
mdb:ļ¬lm/1267 rdfs:label āThe Last Tycoonā
mdb:ļ¬lm/3418 movie:actor mdb:actor/29704
mdb:ļ¬lm/3418 rdfs:label āThe Passengerā
geo:2635167 gn:name āUnited Kingdomā
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOļ¬er bm:oļ¬ers/0743424425amazonOļ¬er
lexvo:iso639-3/eng rdfs:label āEnglishā
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM
T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=ā r d f s : l a b e l ā
AND T2 . p=ā movie : relatedBook ā
AND T3 . p=ā movie : d i r e c t o r ā
AND T4 . p=ā rev : r a t i n g ā
AND T5 . p=ā movie : d i r e c t o r n a m e ā
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o > 4.0
AND T5 . o=ā S t a n l e y Kubrick ā
22ICDCSā17/2017-06-07
38. NaĀØıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ā Stanley Kubrick ā .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r > 4.0)
}
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:initial release date ā1980-05-23ā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2014 movie:actor mdb:actor/29704
mdb:ļ¬lm/2014 movie:actor mdb: actor/30013
mdb:ļ¬lm/2014 movie:music contributor mdb: music contributor/4110
mdb:ļ¬lm/2014 foaf:based near geo:2635167
mdb:ļ¬lm/2014 movie:relatedBook bm:0743424425
mdb:ļ¬lm/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name āStanley Kubrickā
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:ļ¬lm/424 movie:director mdb:director/8476
mdb:ļ¬lm/424 rdfs:label āSpartacusā
mdb:actor/29704 movie:actor name āJack Nicholsonā
mdb:ļ¬lm/1267 movie:actor mdb:actor/29704
mdb:ļ¬lm/1267 rdfs:label āThe Last Tycoonā
mdb:ļ¬lm/3418 movie:actor mdb:actor/29704
mdb:ļ¬lm/3418 rdfs:label āThe Passengerā
geo:2635167 gn:name āUnited Kingdomā
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOļ¬er bm:oļ¬ers/0743424425amazonOļ¬er
lexvo:iso639-3/eng rdfs:label āEnglishā
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM
T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=ā r d f s : l a b e l ā
AND T2 . p=ā movie : relatedBook ā
AND T3 . p=ā movie : d i r e c t o r ā
AND T4 . p=ā rev : r a t i n g ā
AND T5 . p=ā movie : d i r e c t o r n a m e ā
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o > 4.0
AND T5 . o=ā S t a n l e y Kubrick ā
Easy to implement
but
too many self-joins!
22ICDCSā17/2017-06-07
39. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ļ¬lm/2014
1 rdfs:label
2 āThe Shiningā
3 movie:initial release date
4 ā1980-05-23ā
5 mdb:director/8476
6 movie:director name
7 āStanley Kubrickā
8 mdb:ļ¬lm/2685
9 movie:director23ICDCSā17/2017-06-07
40. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ļ¬lm/2014
1 rdfs:label
2 āThe Shiningā
3 movie:initial release date
4 ā1980-05-23ā
5 mdb:director/8476
6 movie:director name
7 āStanley Kubrickā
8 mdb:ļ¬lm/2685
9 movie:director23ICDCSā17/2017-06-07
41. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ļ¬lm/2014
1 rdfs:label
2 āThe Shiningā
3 movie:initial release date
4 ā1980-05-23ā
5 mdb:director/8476
6 movie:director name
7 āStanley Kubrickā
8 mdb:ļ¬lm/2685
9 movie:director
Advantages
Eliminates some of the joins ā they become range queries
Merge join is easy and fast
23ICDCSā17/2017-06-07
42. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ļ¬lm/2014
1 rdfs:label
2 āThe Shiningā
3 movie:initial release date
4 ā1980-05-23ā
5 mdb:director/8476
6 movie:director name
7 āStanley Kubrickā
8 mdb:ļ¬lm/2685
9 movie:director
Advantages
Eliminates some of the joins ā they become range queries
Merge join is easy and fast
Disadvantages
Space usage
Expensive updates
23ICDCSā17/2017-06-07
43. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:actor/29704 movie:actor name āJack Nicholsonā
. . . . . . . . .
Subject refs:label movie:director
mob:ļ¬lm/2014 āThe Shiningā mob:director/8476
mob:ļ¬lm/2685 āThe Clockwork Orangeā mob:director/8476
Subject movie:actor name
mdb:actor āJack Nicholsonā
24ICDCSā17/2017-06-07
44. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:actor/29704 movie:actor name āJack Nicholsonā
. . . . . . . . .
Subject refs:label movie:director
mob:ļ¬lm/2014 āThe Shiningā mob:director/8476
mob:ļ¬lm/2685 āThe Clockwork Orangeā mob:director/8476
Subject movie:actor name
mdb:actor āJack Nicholsonā
Advantages
Fewer joins
If the data is structured, we have a relational system ā similar
to normalized relations
24ICDCSā17/2017-06-07
45. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:actor/29704 movie:actor name āJack Nicholsonā
. . . . . . . . .
Subject refs:label movie:director
mob:ļ¬lm/2014 āThe Shiningā mob:director/8476
mob:ļ¬lm/2685 āThe Clockwork Orangeā mob:director/8476
Subject movie:actor name
mdb:actor āJack Nicholsonā
Advantages
Fewer joins
If the data is structured, we have a relational system ā similar
to normalized relations
Disadvantages
Potentially a lot of NULLs
Clustering is not trivial
Multi-valued properties are complicated
24ICDCSā17/2017-06-07
46. Vertical Partitioning
Binary Tables [Abadi et al., 2007, 2009]:
Grouping by properties ā for each property, build a two-column
table, containing both subject and object, ordered by subjects
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:ļ¬lm/2014 rdfs:label āThe Shiningā
mdb:ļ¬lm/2014 movie:director mdb:director/8476
mdb:ļ¬lm/2685 movie:director mdb:director/8476
mdb:ļ¬lm/2685 rdfs:label āA Clockwork Orangeā
mdb:actor/29704 movie:actor name āJack Nicholsonā
. . . . . . . . .
Subject Object
mdb:ļ¬lm/2014 mdb:director/8476
mdb:ļ¬lm/2685 mdb:director/8476
movie:director
Subject Object
mob:ļ¬lm/2014 āThe Shiningā
mob:ļ¬lm/2685 āThe Clockwork Orangeā
refs:label
Subject Object
mdb:actor/29704 āJack Nicholsonā
movie:actor name
25ICDCSā17/2017-06-07
47. Vertical Partitioning
Binary Tables [Abadi et al., 2007, 2009]:
Grouping by properties ā for each property, build a two-column
table, containing both subject and object, ordered by subjects
n two column tables (n is the number of unique properties in
the data)
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
25ICDCSā17/2017-06-07
48. Vertical Partitioning
Binary Tables [Abadi et al., 2007, 2009]:
Grouping by properties ā for each property, build a two-column
table, containing both subject and object, ordered by subjects
n two column tables (n is the number of unique properties in
the data)
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
Disadvantages
Not useful for subject-object joins
Expensive inserts
25ICDCSā17/2017-06-07
49. Vertical Partitioning
Binary Tables [Abadi et al., 2007, 2009]:
Grouping by properties ā for each property, build a two-column
table, containing both subject and object, ordered by subjects
n two column tables (n is the number of unique properties in
the data)
TripleBit [Yuan et al., 2013]:
Create a table with |triple| columns, |objects| + |subjects| rows
with ā1ā if object/subject exists in triple; groups columns by
predicate
Compress columns (since they are sparse); partition by
predicate, then partition into chunks
(P,S,O) and (P,O,S) indexes on the chunks
25ICDCSā17/2017-06-07
50. Graph-based Approach [Zou and ĀØOzsu, 2017]
Answering SPARQL query ā” subgraph matching using
homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [AluĀøc et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
āStanley Kubrickā
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:ļ¬lm/2014
ā1980-05-23ā
movie:initial release date
āThe Shiningā
refs:label
mob:music contributor
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
bm:persons/StephenKing
dc:creator
bm:oļ¬ers/0743424425amazonOļ¬er
geo:2635167
āUnited Kingdomā
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
āJack Nicholsonā
movie:actor name
mdb:ļ¬lm/3418
āThe Passengerā
refs:label
mdb:ļ¬lm/1267
āThe Last Tycoonā
refs:label
mdb:director/8476
āStanley Kubrickā
movie:director name
mdb:ļ¬lm/2685
āA Clockwork Orangeā
refs:label
mdb:ļ¬lm/424
āSpartacusā
refs:label
mdb:actor/30013
āShelley Duvallā
movie:actor name
āEnglishā
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOļ¬er
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
26ICDCSā17/2017-06-07
51. Graph-based Approach [Zou and ĀØOzsu, 2017]
Answering SPARQL query ā” subgraph matching using
homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [AluĀøc et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
āStanley Kubrickā
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:ļ¬lm/2014
ā1980-05-23ā
movie:initial release date
āThe Shiningā
refs:label
mob:music contributor
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
bm:persons/StephenKing
dc:creator
bm:oļ¬ers/0743424425amazonOļ¬er
geo:2635167
āUnited Kingdomā
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
āJack Nicholsonā
movie:actor name
mdb:ļ¬lm/3418
āThe Passengerā
refs:label
mdb:ļ¬lm/1267
āThe Last Tycoonā
refs:label
mdb:director/8476
āStanley Kubrickā
movie:director name
mdb:ļ¬lm/2685
āA Clockwork Orangeā
refs:label
mdb:ļ¬lm/424
āSpartacusā
refs:label
mdb:actor/30013
āShelley Duvallā
movie:actor name
āEnglishā
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOļ¬er
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
26ICDCSā17/2017-06-07
52. Graph-based Approach [Zou and ĀØOzsu, 2017]
Answering SPARQL query ā” subgraph matching using
homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [AluĀøc et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
āStanley Kubrickā
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:ļ¬lm/2014
ā1980-05-23ā
movie:initial release date
āThe Shiningā
refs:label
mob:music contributor
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
bm:persons/StephenKing
dc:creator
bm:oļ¬ers/0743424425amazonOļ¬er
geo:2635167
āUnited Kingdomā
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
āJack Nicholsonā
movie:actor name
mdb:ļ¬lm/3418
āThe Passengerā
refs:label
mdb:ļ¬lm/1267
āThe Last Tycoonā
refs:label
mdb:director/8476
āStanley Kubrickā
movie:director name
mdb:ļ¬lm/2685
āA Clockwork Orangeā
refs:label
mdb:ļ¬lm/424
āSpartacusā
refs:label
mdb:actor/30013
āShelley Duvallā
movie:actor name
āEnglishā
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOļ¬er
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
Disadvantages
Graph pattern matching is expensive
26ICDCSā17/2017-06-07
53. Two Systems
gStoreSystem Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
bitstrings, denoted as vS ig(u). We encode query Q with the
same encoding method. Consequently, the match between Q
and G can be veriļ¬ed by simply checking the match between
corresponding encoded bitstrings.
Given a vertex u, we encode each of its adjacent edges
e(eLabel, nLabel) into a bitstring, where eLabel is the edge
chameleon-db
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
27ICDCSā17/2017-06-07
54. Two Systems
gStoreSystem Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
bitstrings, denoted as vS ig(u). We encode query Q with the
same encoding method. Consequently, the match between Q
and G can be veriļ¬ed by simply checking the match between
corresponding encoded bitstrings.
Given a vertex u, we encode each of its adjacent edges
e(eLabel, nLabel) into a bitstring, where eLabel is the edge
12,000 lines of C++ code under
Linux (plus code for SPARQL parser)
Encode each vertex of RDF graph as
a bit array capturing the
neighbourhood relationship (Gā
)
Build a multilevel summary tree index
(VSā
-tree) to capture āconnectionsā
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
27ICDCSā17/2017-06-07
55. Two Systems
gStoreSystem Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
bitstrings, denoted as vS ig(u). We encode query Q with the
same encoding method. Consequently, the match between Q
and G can be veriļ¬ed by simply checking the match between
corresponding encoded bitstrings.
Given a vertex u, we encode each of its adjacent edges
e(eLabel, nLabel) into a bitstring, where eLabel is the edge
Encode the query graph similarly (Qā
)
Find candidate matching nodes of Qā
in Gā
using VS*-tree
Multiway join of the candidates
12,000 lines of C++ code under
Linux (plus code for SPARQL parser)
Encode each vertex of RDF graph as
a bit array capturing the
neighbourhood relationship (Gā
)
Build a multilevel summary tree index
(VSā
-tree) to capture āconnectionsā
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
27ICDCSā17/2017-06-07
56. Two Systems
35,000 lines of C++ code under Linux
(plus code for SPARQL 1.0 parser)
Adaptivity to workload due to
variability of Web workloads and the
variability of composition of SPARQL
triple patterns
An experiment [AluĀøc et al., 2014a]
No single system is a sole winner
across all queries
No single system is the sole loser
across all queries, either
2ā5 orders of magnitude diļ¬erence
in the performance between the
best and the worst system for a
given query
The winner in one query may
timeout in another
Performance diļ¬erence widens as
dataset size increases
Group-by-query approach [AluĀøc et al.,
2014b]
chameleon-db
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
27ICDCSā17/2017-06-07
57. Remember the Environment
Distributed environment
Some (not all) of the data
sites can process SPARQL
queries ā SPARQL
endpoints
See next section
28ICDCSā17/2017-06-07
58. Remember the Environment
Distributed environment
Some (not all) of the data
sites can process SPARQL
queries ā SPARQL
endpoints
See next section
Alternatives
Cloud-based approaches
Data re-distribution +
query decomposition
Data re-distribution +
partial evaluation
28ICDCSā17/2017-06-07
59. Cloud-based Solutions [Kaoudi and Manolescu, 2015]
RDF data warehouse D is partitioned ({D1, . . . , Dn}) and
placed on cloud platforms (such as HDFS, HBase)
29ICDCSā17/2017-06-07
60. Cloud-based Solutions [Kaoudi and Manolescu, 2015]
RDF data warehouse D is partitioned ({D1, . . . , Dn}) and
placed on cloud platforms (such as HDFS, HBase)
SPARQL query is run through MapReduce jobs
Data parallel execution
29ICDCSā17/2017-06-07
61. Cloud-based Solutions [Kaoudi and Manolescu, 2015]
RDF data warehouse D is partitioned ({D1, . . . , Dn}) and
placed on cloud platforms (such as HDFS, HBase)
SPARQL query is run through MapReduce jobs
Data parallel execution
Examples: HARD [Rohloļ¬ and Schantz, 2010] , HadoopRDF
[Husain et al., 2011] , EAGRE [Zhang et al., 2013] and
JenaHBase [Khadilkar et al., 2012]
29ICDCSā17/2017-06-07
62. Cloud-based Solutions [Kaoudi and Manolescu, 2015]
RDF data warehouse D is partitioned ({D1, . . . , Dn}) and
placed on cloud platforms (such as HDFS, HBase)
SPARQL query is run through MapReduce jobs
Data parallel execution
Examples: HARD [Rohloļ¬ and Schantz, 2010] , HadoopRDF
[Husain et al., 2011] , EAGRE [Zhang et al., 2013] and
JenaHBase [Khadilkar et al., 2012]
High scalability and fault-tolerance
Possibly low performance since MapReduce is not suitable for
graph processing
29ICDCSā17/2017-06-07
63. Partition-based Approaches
(Oļ¬ine) Partition an RDF data warehouse (graph) into
several fragments that are distributed to sites
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
30ICDCSā17/2017-06-07
64. Partition-based Approaches
(Oļ¬ine) Partition an RDF data warehouse (graph) into
several fragments that are distributed to sites
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
30ICDCSā17/2017-06-07
65. Partition-based Approaches
(Oļ¬ine) Partition an RDF data warehouse (graph) into
several fragments that are distributed to sites
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
(Online) SPARQL query decomposed Q = {Q1, . . . , Qk} ā
query graph is decomposed
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
30ICDCSā17/2017-06-07
66. Partition-based Approaches
(Oļ¬ine) Partition an RDF data warehouse (graph) into
several fragments that are distributed to sites
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
(Online) SPARQL query decomposed Q = {Q1, . . . , Qk} ā
query graph is decomposed
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
Examples: GraphPartition [Huang et al., 2011], WARP [Hose
and Schenkel, 2013] , Partout [Galarraga et al., 2014] ,
Vertex-block [Lee and Liu, 2013]
30ICDCSā17/2017-06-07
67. Partition-based Approaches
(Oļ¬ine) Partition an RDF data warehouse (graph) into
several fragments that are distributed to sites
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
(Online) SPARQL query decomposed Q = {Q1, . . . , Qk} ā
query graph is decomposed
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
Examples: GraphPartition [Huang et al., 2011], WARP [Hose
and Schenkel, 2013] , Partout [Galarraga et al., 2014] ,
Vertex-block [Lee and Liu, 2013]
High performance
Great for parallelizing centralized RDF data
May not be possible to re-partition and re-allocate Web data
(i.e., LOD)
Each approach requires a speciļ¬c partitioning strategy ā no
generic partitioning
Query decomposition may not be easy
30ICDCSā17/2017-06-07
68. Partial Query Evaluation (PQE)
RDF data warehouse is partitioned and distributed as before
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
SPARQL query is not decomposed
Partial query evaluation ā Distributed gStore [Peng et al., 2016]
31ICDCSā17/2017-06-07
69. Partial Query Evaluation (PQE)
RDF data warehouse is partitioned and distributed as before
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
SPARQL query is not decomposed
Partial query evaluation ā Distributed gStore [Peng et al., 2016]
f (x) ā f (s, d) ā f (f (s), d)) ā Final Answerf (s, d)
known inputs unknown inputs
31ICDCSā17/2017-06-07
70. Partial Query Evaluation (PQE)
RDF data warehouse is partitioned and distributed as before
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
SPARQL query is not decomposed
Partial query evaluation ā Distributed gStore [Peng et al., 2016]
f (x) ā f (s, d) ā f (f (s), d)) ā Final Answerf (s, d)
known inputs unknown inputs
f (f (s), d))
partial results
31ICDCSā17/2017-06-07
71. Partial Query Evaluation (PQE)
RDF data warehouse is partitioned and distributed as before
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
SPARQL query is not decomposed
Partial query evaluation ā Distributed gStore [Peng et al., 2016]
f (x) ā f (s, d) ā f (f (s), d)) ā Final Answerf (s, d)
known inputs unknown inputs
f (f (s), d))
partial results
Query is the function and each Di is the known input
31ICDCSā17/2017-06-07
72. Distributed SPARQL Using PQE [Peng et al., 2016]
Two steps:
1. Evaluate a query at each site to ļ¬nd local matches
These are local partial matches
D1
D2
D3
D4
32ICDCSā17/2017-06-07
73. Distributed SPARQL Using PQE [Peng et al., 2016]
Two steps:
1. Evaluate a query at each site to ļ¬nd local matches
These are local partial matches
2. Assemble the partial matches to get ļ¬nal result
Crossing match
Centralized assembly
Distributed assembly
D1
D2
D3
D4
Crossing match
32ICDCSā17/2017-06-07
74. Distributed SPARQL Using PQE [Peng et al., 2016]
Two steps:
1. Evaluate a query at each site to ļ¬nd local matches
These are local partial matches
2. Assemble the partial matches to get ļ¬nal result
Crossing match
Centralized assembly
Distributed assembly
D1
D2
D3
D4
Crossing match
High performance due to parallelization
Do not have to deal with query decomposition
May not be possible to re-partition and re-allocate Web data
(i.e., LOD)
RDF storage sites need to be modiļ¬ed to handle partial query
processing
32ICDCSā17/2017-06-07
75. Outline
1 RDF Technology [ĀØOzsu, 2016]
Data Warehousing Approach
Distributed RDF Processing
2 Federated RDF Systems
SPARQL Endpoint Federation
General RDF Federation
3 LOD ā Live Querying Approach [Hartig, 2013a]
Traversal-based approaches
Index-based approaches
Hybrid approaches
4 Conclusions
33ICDCSā17/2017-06-07
76. Remember the Environment
Distributed environment
SPARQL endpoints can
process SPARQL queries
Non-SPARQL endpoints
require additional
components
34ICDCSā17/2017-06-07
77. Remember the Environment
Distributed environment
SPARQL endpoints can
process SPARQL queries
Non-SPARQL endpoints
require additional
components
Issues
Query decomposition
Localization (source
selection)
Result composition
34ICDCSā17/2017-06-07
78. SPARQL Endpoint Federation
No data re-partitioning/re-distribution
Consider D = D1 āŖ D2 āŖ . . . āŖ Dn; Di : SPARQL endpoint
SPARQL query decomposed Q = {Q1, . . . , Qk}
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
E.g.: SPLENDID [GĀØorlitz and Staab, 2011], ANAPSID
[Acosta et al., 2011]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
35ICDCSā17/2017-06-07
79. SPARQL Endpoint Federation
No data re-partitioning/re-distribution
Consider D = D1 āŖ D2 āŖ . . . āŖ Dn; Di : SPARQL endpoint
SPARQL query decomposed Q = {Q1, . . . , Qk}
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
E.g.: SPLENDID [GĀØorlitz and Staab, 2011], ANAPSID
[Acosta et al., 2011]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
Metadata
35ICDCSā17/2017-06-07
80. SPARQL Endpoint Federation
No data re-partitioning/re-distribution
Consider D = D1 āŖ D2 āŖ . . . āŖ Dn; Di : SPARQL endpoint
SPARQL query decomposed Q = {Q1, . . . , Qk}
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
E.g.: SPLENDID [GĀØorlitz and Staab, 2011], ANAPSID
[Acosta et al., 2011]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
Metadata
Data included at the source
Supported access patterns
Statistical information
Ā· Ā· Ā·
35ICDCSā17/2017-06-07
81. SPARQL Endpoint Federation
No data re-partitioning/re-distribution
Consider D = D1 āŖ D2 āŖ . . . āŖ Dn; Di : SPARQL endpoint
SPARQL query decomposed Q = {Q1, . . . , Qk}
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
E.g.: SPLENDID [GĀØorlitz and Staab, 2011], ANAPSID
[Acosta et al., 2011]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
Metadata
Data integration approach
May be the only way to proceed if RDF data is already
distributed with autonomous owners
Not all RDF data storage points are SPARQL endpoints
35ICDCSā17/2017-06-07
82. Not All RDF Storage Sites are SPARQL Endpoints
Use the mediator-wrapper paradigm
Wrappers provide SPARQL endpoint functionality
Mediators may be introduced if wrappers are thin
E.g.: DARQ [Quilitz and Leser, 2008], FedX [Schwarte et al.,
2011b]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Storage
A
Wrapper
Wrapper
Mediator
RDF Storage
B
RDF Sources
Control Site
Metadata
36ICDCSā17/2017-06-07
83. Federated Query Processing
Query
Decomposition &
Source Selection
SPARQL queries
Local Evaluation
Join Partial
Matches
SPARQL matches
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
37ICDCSā17/2017-06-07
84. Query Decomposition
Each triple pattern has to a set of RDF sources based on the
values of its subject, property, and object.
SELECT ?x ?n
WHERE {
?x g : parentFeature ?k .
?k g : name āCanadaā .
?y sameAs ?x .
?y n : topicPage ?n .
}
38ICDCSā17/2017-06-07
85. Query Decomposition
Each triple pattern has to a set of RDF sources based on the
values of its subject, property, and object.
SELECT ?x ?n
WHERE {
?x g : parentFeature ?k .
?k g : name āCanadaā .
?y sameAs ?x .
?y n : topicPage ?n .
}
{GeoNames} {GeoNames} {DBPedia,GeoNames,NYTimes,
SWDogFood,LinkedMDB}
{NYTimes}
38ICDCSā17/2017-06-07
86. Query Decomposition
Each triple pattern has to a set of RDF sources based on the
values of its subject, property, and object.
SELECT ?x ?n
WHERE {
?x g : parentFeature ?k .
?k g : name āCanadaā .
?y sameAs ?x .
?y n : topicPage ?n .
}
{GeoNames} {GeoNames} {DBPedia,GeoNames,NYTimes,
SWDogFood,LinkedMDB}
{NYTimes}
q1
@{GeoNames} q2
@{. . .} q3
@{NYTimes}
SELECT ?x
WHERE {
?x g : parentFeature ?k .
?k g : name āCanadaā .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?n
WHERE {
?y n : topicPage ?n .
}
38ICDCSā17/2017-06-07
87. Data Localization
Metadata-based approaches
Use the information in the metadata repository to determine
which sources are relevant
DARQ [Quilitz and Leser, 2008]
QTree [Harth et al., 2010; Prasser et al., 2012]
HiBISCus [Saleem and Ngomo, 2014]
. . .
ASK query-based approach
Asking whether or not a triple pattern has an answer at a
source
FedX [Schwarte et al., 2011a,b]
39ICDCSā17/2017-06-07
88. Query Processing over Federated RDF Systems
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
Metadata
SELECT ?x ?n
WHERE {
?x g : parentFeature ?k .
?k g : name āCanadaā .
?y sameAs ?x .
?y n : topicPage ?n .
}
SELECT ?x
WHERE {
?x g : parentFeature ?k .
?k g : name āCanadaā .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?n
WHERE {
?y n : topicPage ?n .
}
40ICDCSā17/2017-06-07
89. UniProt Federation ā EBI RDF Platform
Curated computational models of biological pro-
cesses
Sample information for reference samples and sam-
ples for which data exist in one of the EBIās assay
databases
Curated chemical database of bioactive molecules
with drug-like properties
Genome databases for vertebrates and other eukary-
otic species
Gene expression data from the Gene Expression Atlas
Curated and peer-reviewed pathways
41ICDCSā17/2017-06-07
90. Federated Access to UniProt Collection
Get the Reactome pathways where Q16850 is associated, then get all the
other proteins in that pathway and pull out their expression from the atlas,
along with the GO annotations from UniProt
PREFIX r d f : <http ://www. w3 . org /1999/02/22ā rdf āsyntaxāns#>
PREFIX r d f s : <http ://www. w3 . org /2000/01/ rdf āschema#>
PREFIX biopax3 : <http ://www. biopax . org / r e l e a s e / biopaxāl e v e l 3 . owl#>
PREFIX a t l a s t e r m s : <http :// r d f . ebi . ac . uk/ terms / a t l a s />
PREFIX upc:< http :// p u r l . u n i p r o t . org / core/>
SELECT DISTINCT ?pathwayname ? e x p r e s s i o n V a l u e ? g o l a b e l
WHERE {
# Get the pathways that r e f e r e n c e Q16850
? pathway r d f : type biopax3 : Pathway .
? pathway biopax3 : displayName ?pathwayname .
? pathway biopax3 : pathwayComponent
[? r e l [ biopax3 : e n t i t y R e f e r e n c e ? dbXref ] ] .
? pathway biopax3 : pathwayComponent
[? r e l [ biopax3 : e n t i t y R e f e r e n c e <http :// p u r l . u n i p r o t . org / u n i p r o t /Q16850 >]] .
# Get the e x p r e s s i o n f o r those p r o t e i n s
SERVICE <http ://www. e bi . ac . uk/ r d f / s e r v i c e s / a t l a s / sparql > {
? value r d f s : l a b e l ? e x p r e s s i o n V a l u e .
? value a t l a s t e r m s : pValue ? pvalue .
? value a t l a s t e r m s : isMeasurementOf ? probe .
? probe a t l a s t e r m s : dbXref ? dbXref .
}
# get the GO f u n c t i o n s from Uniprot
SERVICE <http :// u n i p r o t . org / sparql > {
? dbXref a upc : Protein ;
upc : c l a s s i f i e d W i t h ? keyword .
? keyword r d f s : seeAlso ? goid .
? goid r d f s : l a b e l ? g o l a b e l .
}
}
42ICDCSā17/2017-06-07
91. Outline
1 RDF Technology [ĀØOzsu, 2016]
Data Warehousing Approach
Distributed RDF Processing
2 Federated RDF Systems
SPARQL Endpoint Federation
General RDF Federation
3 LOD ā Live Querying Approach [Hartig, 2013a]
Traversal-based approaches
Index-based approaches
Hybrid approaches
4 Conclusions
43ICDCSā17/2017-06-07
92. Live Query Processing
Not all data resides at
SPARQL endpoints
Freshness of access to data
important
Potentially countably inļ¬nite
data sources
Live querying
On-line execution
Only rely on linked data
principles
Alternatives
Traversal-based
approaches
Index-based approaches
Hybrid approaches
44ICDCSā17/2017-06-07
93. Linked Data Model [Hartig, 2012]
Web of Linked Data
Given a ļ¬nite or countably inļ¬nite set D of Linked Documents, a
Web of Linked Data is a tuple W = (D, adoc, data) where:
D ā D,
adoc is a partial mapping from URIs to D, and
data is a total mapping from D to ļ¬nite sets of RDF triples.
45ICDCSā17/2017-06-07
94. Linked Data Model [Hartig, 2012]
Web of Linked Data
Given a ļ¬nite or countably inļ¬nite set D of Linked Documents, a
Web of Linked Data is a tuple W = (D, adoc, data) where:
D ā D,
adoc is a partial mapping from URIs to D, and
data is a total mapping from D to ļ¬nite sets of RDF triples.
Data Links
A Web of Linked Data W = (D, adoc, data)
contains a data link from document d ā D to
document d ā D if there exists a URI u such
that:
u is mentioned in an RDF triple
t ā data(d), and
d = adoc(u).
45ICDCSā17/2017-06-07
95. SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
46ICDCSā17/2017-06-07
96. SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S,
and a reachability condition c
Scope: all data along paths of data links that satisfy the
condition
Computationally feasible
46ICDCSā17/2017-06-07
97. Traversal Approaches
Discover relevant URIs recursively
by traversing (speciļ¬c) data links
at query execution runtime [Hartig,
2013b; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
47ICDCSā17/2017-06-07
98. Traversal Approaches
Discover relevant URIs recursively
by traversing (speciļ¬c) data links
at query execution runtime [Hartig,
2013b; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
Advantages
Easy to implement.
No data structure to maintain.
47ICDCSā17/2017-06-07
99. Traversal Approaches
Discover relevant URIs recursively
by traversing (speciļ¬c) data links
at query execution runtime [Hartig,
2013b; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
Advantages
Easy to implement.
No data structure to maintain.
Disadvantages
Possibilities for parallelized data retrieval are limited
Repeated data retrieval introduces signiļ¬cant query latency.
47ICDCSā17/2017-06-07
100. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Diļ¬erent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
48ICDCSā17/2017-06-07
101. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Diļ¬erent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
48ICDCSā17/2017-06-07
102. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Diļ¬erent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
Disadvantages
Querying can only start after index construction
Depends on what has been selected for the index
Freshness may be an issue
Index maintenance
48ICDCSā17/2017-06-07
103. Hybrid Approach
Perform a traversal-based execution using a prioritized list of
URIs to look up [Ladwig and Tran, 2010]
Initial seed from the pre-populated index
Non-seed URIs are ranked by a function based on information
in the index
New discovered URIs that are not in the index are ranked
according to number of referring documents
49ICDCSā17/2017-06-07
104. Outline
1 RDF Technology [ĀØOzsu, 2016]
Data Warehousing Approach
Distributed RDF Processing
2 Federated RDF Systems
SPARQL Endpoint Federation
General RDF Federation
3 LOD ā Live Querying Approach [Hartig, 2013a]
Traversal-based approaches
Index-based approaches
Hybrid approaches
4 Conclusions
50ICDCSā17/2017-06-07
105. Conclusions
RDF and Linked Object Data seem to have considerable
promise for Web data management
There are prototype systems that provide alternative solutions
There are commercial systems as well
See https://www.w3.org/wiki/SparqlImplementations
for a list
More work needs to be done
Query semantics
Adaptive system design
Optimizations ā both in data warehousing and distributed
environments
Live querying requires signiļ¬cant thought to reduce latency
51ICDCSā17/2017-06-07
106. Conclusions
What I did not talk about:
Not much on general distributed/parallel processing
Not much on SPARQL semantics
Nothing about RDFS ā no schema stuļ¬
Nothing about entailment regimes > 0 ā no reasoning
52ICDCSā17/2017-06-07
108. References I
Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a
vertically partitioned DBMS for semantic web data management. VLDB J.,
18(2):385ā406.
Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable
semantic web data management using vertical partitioning. In Proc. 33rd
Int. Conf. on Very Large Data Bases, pages 411ā422.
Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011).
ANAPSID: an adaptive query processing engine for SPARQL endpoints. In
Proc. 10th Int. Semantic Web Conf., pages 18ā34.
AluĀøc, G., Hartig, O., ĀØOzsu, M. T., and Daudjee, K. (2014a). Diversiļ¬ed stress
testing of RDF data management systems. In Proc. 13th Int. Semantic Web
Conf., pages 197ā212.
AluĀøc, G., ĀØOzsu, M. T., and Daudjee, K. (2014b). Workload matters: Why RDF
databases need a new design. Proc. VLDB Endowment, 7(10):837ā840.
54ICDCSā17/2017-06-07
109. References II
AluĀøc, G., ĀØOzsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: a
workload-aware robust RDF data management system. Technical Report
CS-2013-10, University of Waterloo. Available at
https://cs.uwaterloo.ca/sites/ca.computer-science/files/
uploads/files/CS-2013-10.pdf.
Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P.,
Udrea, O., and Bhattacharjee, B. (2013). Building an eļ¬cient RDF store
over a relational database. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, pages 121ā132.
Galarraga, L., Hose, K., and Schenkel, R. (2014). Partout: a distributed engine
for eļ¬cient RDF processing. In Proc. 23rd Int. World Wide Web Conf.
(Companion Volume), pages 267ā268.
GĀØorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation
exploiting VOID descriptions. In Proc. 2nd Int. Workshop on Consuming
Linked Data.
55ICDCSā17/2017-06-07
110. References III
Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: A
distributed shared-nothing RDF engine based on asynchronous message
passing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
289ā300.
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., and Umbrich, J.
(2010). Data summaries for on-demand queries over linked data. In Proc.
19th Int. World Wide Web Conf., pages 411ā420. Available from:
http://doi.acm.org/10.1145/1772690.1772733.
Hartig, O. (2012). SPARQL for a web of linked data: Semantics and
computability. In Proc. 9th Extended Semantic Web Conf., pages 8ā23.
Hartig, O. (2013a). An overview on execution strategies for linked data queries.
Datenbank-Spektrum, 13(2):89ā99. Available from:
http://dx.doi.org/10.1007/s13222-013-0122-1.
Hartig, O. (2013b). SQUIN: a traversal based query execution system for the
web of linked data. In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 1081ā1084.
56ICDCSā17/2017-06-07
111. References IV
Hose, K. and Schenkel, R. (2013). WARP: Workload-aware replication and
partitioning for RDF. In Proc. Workshops of 29th Int. Conf. on Data
Engineering, pages 1ā6.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of
large RDF graphs. Proc. VLDB Endowment, 4(11):1123ā1134.
Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham,
B. (2011). Heuristics-based query processing for large RDF graphs using
cloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312ā1327.
Kaoudi, Z. and Manolescu, I. (2015). RDF in the clouds: A survey. VLDB J.,
24:67ā91.
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B. M., and Castagna, P.
(2012). Jena-HBase: A distributed, scalable and eļ¬cient RDF triple store.
In Proc. International Semantic Web Conference Posters & Demos Track.
Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. In
Proc. 9th Int. Semantic Web Conf., pages 453ā469.
Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linked
data. In Proc. 8th Extended Semantic Web Conf., pages 139ā153.
57ICDCSā17/2017-06-07
112. References V
Lee, K. and Liu, L. (2013). Scaling queries over big RDF graphs with semantic
hash partitioning. Proc. VLDB Endowment, 6(14):1894ā1905. Available
from: http://www.vldb.org/pvldb/vol6/p1894-lee.pdf.
Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF.
Proc. VLDB Endowment, 1(1):647ā659.
Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalable
management of RDF data. VLDB J., 19(1):91ā113.
ĀØOzsu, M. T. (2016). A survey of RDF data management systems. Front.
Comput. Sci., 10(3):418ā432.
Peng, P., Zou, L., ĀØOzsu, M. T., Chen, L., and Zhao, D. (2016). Processing
SPARQL queries over distributed RDF graphs. VLDB J., 25(2):243ā268.
Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Eļ¬cient distributed query
processing for autonomous rdf databases. In Proc. 15th Int. Conf. on
Extending Database Technology, pages 372ā383.
Quilitz, B. and Leser, U. (2008). Querying distributed RDF data sources with
SPARQL. In Proc. 5th European Semantic Web Conf., pages 524ā538.
58ICDCSā17/2017-06-07
113. References VI
Rohloļ¬, K. and Schantz, R. E. (2010). High-performance, massively scalable
distributed systems using the mapreduce software framework: the shard
triple-store. In Proc. Int. Workshop on Programming Support Innovations
for Emerging Distributed Applications. Article No. 4.
Saleem, M. and Ngomo, A. N. (2014). HiBISCuS: Hypergraph-based source
selection for SPARQL endpoint federation. In Proc. 11th Extended Semantic
Web Conf., pages 176ā191. Available from:
http://dx.doi.org/10.1007/978-3-319-07443-6_13.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011a).
FedX: A federation layer for distributed query processing on linked open
data. In Proc. 8th Extended Semantic Web Conf., pages 481ā486.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011b).
Fedx: Optimization techniques for federated query processing on linked
data. In Proc. 10th Int. Semantic Web Conf., pages 601ā616. Available
from: https://doi.org/10.1007/978-3-642-25073-6_38.
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011).
Comparing data summaries for processing live queries over linked data.
World Wide Web J., 14(5-6):495ā544.
59ICDCSā17/2017-06-07
114. References VII
Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexing
for semantic web data management. Proc. VLDB Endowment,
1(1):1008ā1019.
Wilkinson, K. (2006). Jena property table implementation. Technical Report
HPL-2006-140, HP Laboratories Palo Alto.
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., and Liu, L. (2013). TripleBit: a
fast and compact system for large scale RDF data. Proc. VLDB
Endowment, 6(7):517ā528. Available from:
http://www.vldb.org/pvldb/vol6/p517-yuan.pdf.
Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towards
scalable I/O eļ¬cient SPARQL query evaluation on the cloud. In Proc. 29th
Int. Conf. on Data Engineering, pages 565ā576.
Zou, L., Mo, J., Chen, L., ĀØOzsu, M. T., and Zhao, D. (2011). gStore:
answering SPARQL queries via subgraph matching. Proc. VLDB
Endowment, 4(8):482ā493.
Zou, L. and ĀØOzsu, M. T. (2017). Graph-based RDF data management. Data
Science and Engineering, 2(1):56ā70. Available from:
https://dx.doi.org/10.1007/s41019-016-0029-6.
60ICDCSā17/2017-06-07
115. References VIII
Zou, L., ĀØOzsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014).
gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565ā590.
61ICDCSā17/2017-06-07