Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the “semantic web”. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDFencoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleondb project (joint work with Günes Aluç, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).
VALA Tech Camp 2017: Intro to Wikidata & SPARQLJane Frazier
A hands-on introduction to interrogation of Wikidata content using SPARQL, the query language used to query data represented in RDF, SKOS, OWL, and other Semantic Web standards.
Presented by myself and Peter Neish, Research Data Specialist @ University of Melbourne.
A presentation by Gordon Dunsire.
Delivered at the Cataloguing and Indexing Group Scotland (CIGS) Linked Open Data (LOD) Conference which took place Fri 21 September 2012 at the Edinburgh Centre for Carbon Innovation.
Web data management has been a topic of interest for many years during which a number of different modelling approaches have been tried. The latest in this approaches is to use RDF (Resource Description Framework), which seems to provide real opportunity for querying at least some of the web data systematically. RDF has been proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the “semantic web”. W3C has also proposed SPARQL as the query language for accessing RDF data repositories. The publication of Linked Open Data (LOD) on the Web has gained tremendous momentum over the last number of years, and this provides a new opportunity to accomplish web data integration. A number of approaches have been proposed for running SPARQL queries over RDFencoded Web data: data warehousing, SPARQL federation, and live linked query execution. In this talk, I will review these approaches with particular emphasis on some of our research within the context of gStore project (joint project with Prof. Lei Zou of Peking University and Prof. Lei Chen of Hong Kong University of Science and Technology), chameleondb project (joint work with Günes Aluç, Dr. Olaf Hartig, and Prof. Khuzaima Daudjee of University of Waterloo), and live linked query execution (joint work with Dr. Olaf Hartig).
VALA Tech Camp 2017: Intro to Wikidata & SPARQLJane Frazier
A hands-on introduction to interrogation of Wikidata content using SPARQL, the query language used to query data represented in RDF, SKOS, OWL, and other Semantic Web standards.
Presented by myself and Peter Neish, Research Data Specialist @ University of Melbourne.
A presentation by Gordon Dunsire.
Delivered at the Cataloguing and Indexing Group Scotland (CIGS) Linked Open Data (LOD) Conference which took place Fri 21 September 2012 at the Edinburgh Centre for Carbon Innovation.
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTHerbert Van de Sompel
DBpedia is the Linked Data version of Wikipedia. Starting in 2007, several DBpedia dumps have been made available for download. In 2010, the Research Library at the Los Alamos National Laboratory used these dumps to deploy a Memento-compliant DBpedia Archive, in order to demonstrate the applicability and appeal of accessing temporal versions of Linked Data sets using the Memento “Time Travel for the Web” protocol. The archive supported datetime negotiation to access various temporal versions of RDF descriptions of DBpedia subject URIs.
In a recent collaboration with the iMinds Group of Ghent University, the DBpedia Archive received a major overhaul. The initial MongoDB storage approach, which was unable to handle increasingly large DBpedia dumps, was replaced by HDT, the Binary RDF Representation for Publication and Exchange. And, in addition to the existing subject URI access point, Triple Pattern Fragments access, as proposed by the Linked Data Fragments project, was added. This allows datetime negotiation for URIs that identify RDF triples that match subject/predicate/object patterns. To add this powerful capability, native Memento support was added to the Linked Data Fragments Server of Ghent University.
In this talk, we will include a brief refresher of Memento, and will cover Linked Data Fragments, Triple Pattern Fragments, and HDT in more detail. We will share lessons learned from this effort and demo the new DBpedia Archive, which, at this point, holds over 5 billion RDF triples.
Presentation at the Online Information Conference, London 20th November 2013. Taking a look at the drivers behind the emerging Web of Data and how libraries need to be and can be part of it in the future.
Connections that work: Linked Open Data demystifiedJakob .
Keynote given 2014-10-22 at the National Library of Finland at Kirjastoverkkopäivät 2014 (https://www.kiwi.fi/pages/viewpage.action?pageId=16767828) #kivepa2014
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
Brief overview of linked data and RDF followed by use in libraries and archives. Originally delivered at OLITA Digital Odyssey 2014. Revised for the OLA Superconference 2015
Open Knowledge Foundation Edinburgh meet-up #3Gill Hamilton
Lightning talks by
Gordon Dunsire on library standards and linked data
Gill Hamilton on recent initiatives with open and linked open data at National Library of Scotland
Australian Open government and research data pilot survey 2017Jonathan Yu
Australian Open data pilot survey conducted October 2017 leveraging indexed datasets across government and research sources via the CSIRO Knowledge Network (http://kn.csiro.au). Please note, these are preliminary results using our prototype quantitative methodology to assess volume, variety and velocity of open data initiatives across Australia. Lots of sources missing (we'd love to hear feedback about which ones would be good to include in the future!). Future work include addressing gaps in sources list, de-duplication of cross-indexed datasets, quantifying web services data, and an online version of the analysis.
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTHerbert Van de Sompel
DBpedia is the Linked Data version of Wikipedia. Starting in 2007, several DBpedia dumps have been made available for download. In 2010, the Research Library at the Los Alamos National Laboratory used these dumps to deploy a Memento-compliant DBpedia Archive, in order to demonstrate the applicability and appeal of accessing temporal versions of Linked Data sets using the Memento “Time Travel for the Web” protocol. The archive supported datetime negotiation to access various temporal versions of RDF descriptions of DBpedia subject URIs.
In a recent collaboration with the iMinds Group of Ghent University, the DBpedia Archive received a major overhaul. The initial MongoDB storage approach, which was unable to handle increasingly large DBpedia dumps, was replaced by HDT, the Binary RDF Representation for Publication and Exchange. And, in addition to the existing subject URI access point, Triple Pattern Fragments access, as proposed by the Linked Data Fragments project, was added. This allows datetime negotiation for URIs that identify RDF triples that match subject/predicate/object patterns. To add this powerful capability, native Memento support was added to the Linked Data Fragments Server of Ghent University.
In this talk, we will include a brief refresher of Memento, and will cover Linked Data Fragments, Triple Pattern Fragments, and HDT in more detail. We will share lessons learned from this effort and demo the new DBpedia Archive, which, at this point, holds over 5 billion RDF triples.
Presentation at the Online Information Conference, London 20th November 2013. Taking a look at the drivers behind the emerging Web of Data and how libraries need to be and can be part of it in the future.
Connections that work: Linked Open Data demystifiedJakob .
Keynote given 2014-10-22 at the National Library of Finland at Kirjastoverkkopäivät 2014 (https://www.kiwi.fi/pages/viewpage.action?pageId=16767828) #kivepa2014
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
Brief overview of linked data and RDF followed by use in libraries and archives. Originally delivered at OLITA Digital Odyssey 2014. Revised for the OLA Superconference 2015
Open Knowledge Foundation Edinburgh meet-up #3Gill Hamilton
Lightning talks by
Gordon Dunsire on library standards and linked data
Gill Hamilton on recent initiatives with open and linked open data at National Library of Scotland
Australian Open government and research data pilot survey 2017Jonathan Yu
Australian Open data pilot survey conducted October 2017 leveraging indexed datasets across government and research sources via the CSIRO Knowledge Network (http://kn.csiro.au). Please note, these are preliminary results using our prototype quantitative methodology to assess volume, variety and velocity of open data initiatives across Australia. Lots of sources missing (we'd love to hear feedback about which ones would be good to include in the future!). Future work include addressing gaps in sources list, de-duplication of cross-indexed datasets, quantifying web services data, and an online version of the analysis.
Visualising the Australian open data and research data landscapeJonathan Yu
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
Finding knowledge, data and answers on the Semantic Webebiquity
Web search engines like Google have made us all smarter by providing ready access to the world's knowledge whenever we need to look up a fact, learn about a topic or evaluate opinions. The W3C's Semantic Web effort aims to make such knowledge more accessible to computer programs by publishing it in machine understandable form.
<p>
As the volume of Semantic Web data grows software agents will need their own search engines to help them find the relevant and trustworthy knowledge they need to perform their tasks. We will discuss the general issues underlying the indexing and retrieval of RDF based information and describe Swoogle, a crawler based search engine whose index contains information on over a million RDF documents.
<p>
We will illustrate its use in several Semantic Web related research projects at UMBC including a distributed platform for constructing end-to-end use cases that demonstrate the semantic web’s utility for integrating scientific data. We describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which searches the Semantic Web for data relevant to a given query ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
The presentation includes three parts: 1) a short introduction to semantic web and linked data; 2) a review of a few projects of interest in the field of earth science; and 3) details about the workflow and algorithms for computing similarity between entities in the semantic web.
Information Extraction and Linked Data CloudDhaval Thakker
In the media industry there is a great emphasis on providing descriptive metadata as part of the media assets to the consumers. Information extraction (IE) is considered an important tool for metadata generation process and its performance largely depend on the knowledge base it utilizes. The advances in the “Linked Data Cloud” research provide a great opportunity for generating such knowledge base that benefit from the participation of wider community. In this talk, I will discuss our experiences of utilizing Linked Data Cloud in conjunction with a GATE-based IE system.
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
An electronic laboratory Notebook (ELN) can be characterized as a system that allows scientists to capture the data and resources used in performing scientific experiments. This allows users to easily organize and find their data however, little information about the scientific process is recorded.
In this paper we highlight the current status of progress toward semantic representation of science in ELNs.
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
A talk given at the Geological Society of London, UK on 2016/03/09 as part of the Lyell meeting on Palaeoinformatics. http://www.geolsoc.org.uk/lyell16 #lyell16
About the Virtual Conference
With the expansion of digital data collection and the increased expectations of data sharing, researchers are turning to their libraries or institutional repositories as a place to store and preserve that data. Many institutions have created such data management services and see the data curation role as a growing and important element of their service portfolio. While some of the experience in managing other types of digital resources is transferrable, the management of large-scale scientific data has many special requirements and challenges. From metadata collection and cataloging data sources, to identification, discovery, and preservation, best practices and standards are still in their infancy.
This Virtual Conference will explore in greater depth than traditional webinars some of the practical lessons from those who have implemented data management and developed best practices, as well as provide some insight into the evolving issues the community faces. It will include discussions related to certification of trusted repositories, provenance and identification issues around data, data citation, preservation, and the work of several repository networks to advance distribution of scientific information.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Web Data Management with RDF
1. Web Data Management in RDF Age
M. Tamer Ozsu
University of Waterloo
David R. Cheriton School of Computer Science
PKU/2014-08-28 1
2. Acknowledgements
This presentation draws upon collaborative research and
discussions with the following colleagues (in alphabetical order)
Gunes Aluc, University of Waterloo
Khuzaima Daudjee, University of Waterloo
Olaf Hartig, University of Waterloo
Lei Chen, Hong Kong University of Science Technology
Lei Zou, Peking University
PKU/2014-08-28 2
3. Web Data Management
A long term research interest in the DB community
2000 2004
2011 2011
PKU/2014-08-28 3
4. Interest Due to Properties of Web Data
Lack of a schema
Data is at best semi-structured
Missing data, additional attributes, similar data but not
identical
Volatility
Changes frequently
May conform to one schema now, but not later
Scale
Does it make sense to talk about a schema for Web?
How do you capture everything?
Querying diculty
What is the user language?
What are the primitives?
Arent search engines or metasearch engines sucient?
PKU/2014-08-28 4
5. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
PKU/2014-08-28 8
6. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
XML
Data exchange language
Primarily tree based structure
list title=MOVIES
film
titleThe Shining/title
directorStanley Kubrick/director
actorJack Nicholson/actor
/film
film
titleSpartacus/title
directorStanley Kubrick/director
/film
film
titleThe Passenger/title
actorJack Nicholson/actor
/film
...
/list
root
9. lm
title
The Passenger
actor
Jack Nicholson
PKU/2014-08-28 8
10. More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
XML
Data exchange language
Primarily tree based structure
RDF (Resource Description Framework) SPARQL
W3C recommendation
Simple, self-descriptive model
Building block of semantic web Linked Open Data (LOD)
PKU/2014-08-28 8
11. RDF and Semantic Web
RDF is a language for the conceptual modeling of information
about resources (web resources in our context)
A building block of semantic web
Facilitates exchange of information
Search engine results can be more focused and structured
Facilitates data integration (mashes)
Machine understandable
Understand the information on the web and the
interrelationships among them
PKU/2014-08-28 9
12. RDF Uses
Yago and DBpedia extract facts from Wikipedia represent
as RDF ! structural queries
Communities build RDF data
E.g., biologists: Bio2RDF and Uniprot RDF
Web data integration
Linked Open Data Cloud
. . .
PKU/2014-08-28 10
13. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
PKU/2014-08-28 11
14. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
As of March 2009
LinkedCT
Reactome
Taxonomy
KEGG
GeneID
PubMed
Pfam
UniProt
OMIM
PDB
BBC
Later +
TOTP
riese
Symbol
ChEBI
Daily
Med
Disea-some
CAS
HGNC
Inter
Pro
Drug
Bank
UniParc
UniRef
ProDom
PROSITE
Gene
Ontology
Homolo
Gene
Pub
Chem
MGI
UniSTS
GEO
Species
Jamendo
BBC
Programm
es
Music-brainz
Magna-tune
Surge
Radio
MySpace
Wrapper
Audio-
Scrobbler
Linked
MDB
BBC
John
Peel
BBC
Playcount
Data
Gov-
Track
US
Census
Data
Geo-names
lingvoj
World
Fact-book
Euro-stat
IRIT
Toulouse
SW
Conference
Corpus
RDF Book
Mashup
Project
Guten-berg
DBLP
Hannover
DBLP
Berlin
LAAS-CNRS
Buda-pest
BME
IEEE
IBM
Resex
Pisa
New-castle
RAE
2001
CiteSeer
ACM
DBLP
RKB
Explorer
eprints
LIBRIS
Semantic
Web.org Eurécom
ECS
South-ampton
SIOC Revyu
Sites
Doap-space
Flickr
exporter
FOAF
profiles
flickr
wrappr
Crunch
Base
Sem-
Web-
Central
Open-
Guides
Wiki-company
QDOS
Pub
Guide
Open
Calais
RDF
ohloh
W3C
WordNet
Open
Cyc
UMBEL
Yago
DBpedia
Freebase
Virtuoso
Sponger
March '09:
89 datasets
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
PKU/2014-08-28 11
15. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
New-castle
User-generated content
As of September 2010
Audio-scrobbler
(DBTune)
Music
Brainz
(zitgist)
P20
YAGO
Chronic-ling
America
World
Fact-book
(FUB)
Geo
Names
Moseley
Folk
WordNet
(VUA)
WordNet
(W3C)
VIVO UF
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
UMBEL
UK Post-codes
legislation
.gov.uk
Uberblic
UB
Mann-heim
TWC LOGD
GTAA
BBC
Program
mes
Twarql
transport
data.gov
.uk
totl.net
Tele-graphis
TCM
Gene
DIT
Taxon
Concept
The Open
Library
(Talis)
t4gm
Surge
Radio
RAMEAU
SH
STW
statistics
data.gov
.uk
St.
Andrews
Resource
Lists
ECS
South-ampton
EPrints
Semantic
Crunch
Base
semantic
web.org
Semantic
XBRL
SW
Dog
Food
rdfabout
US SEC
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
LAAS
KISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints
dotAC
DEPLOY
DBLP
(RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov
.uk
reference
data.gov
.uk
Recht-spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
PSH
lingvoj
Product
DB
Poké-pédia
PBAC
Ord-nance
Survey
Openly
Local
The Open
Library
Open
Cyc
Open
Calais
OpenEI
New
York
Times
NTU
Resource
Lists
NDL
subjects
MARC
Codes
List
Man-chester
Reading
Lists
Lotico
The
London
Gazette
LOIUS
lobid
Resources
lobid
Organi-sations
Linked
MDB
Linked
LCCN
Linked
GeoData
Linked
CT
Linked
Open
Numbers
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Good-win
Family
Jamendo
iServe
NSZL
Catalog
GovTrack
GESIS
Geo
Species
Geo
Linked
Data
(es)
STITCH
Project
Guten-berg
(FUB)
SIDER
Medi
Care
Euro-stat
(FUB)
Drug
Bank
Disea-some
DBLP
(FU
Berlin)
Daily
Med
Freebase
flickr
wrappr
Fishes
of Texas
FanHubz
Event-
Media
EUTC
Produc-tions
Eurostat
EUNIS
ESD
stan-dards
Popula-tion
(En-
AKTing)
NHS
(EnAKTing)
Mortality
(En-
AKTing)
Energy
(En-
AKTing)
CO2
(En-
AKTing)
education
data.gov
.uk
ECS
South-ampton
Gem.
Norm-datei
data
dcs
MySpace
(DBTune)
Music
Brainz
(DBTune)
Magna-tune
John
Peel
(DB
Tune)
classical
(DB
Tune)
Last.fm
Artists
(DBTune)
DB
Tropes
dbpedia
lite
DBpedia
Pokedex
Airports
NASA
(Data
Incu-bator)
Music
Brainz
(Data
Incubator)
Discogs
(Data In-cubator)
Climbing
Linked Data
for Intervals
Cornetto
Chem2
Bio2RDF
biz.
data.
gov.uk
UniRef
UniSTS
Uni
Path-way
Taxo-nomy
UniParc
UniProt
SGD
Reactome
PubMed
Pub
Chem
Pfam PDB
PRO-SITE
ProDom
OMIM
OBO
MGI
KEGG
Reaction
KEGG
Drug
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Cpd
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Gen
Bank
ChEBI
CAS
Affy-metrix
BibBase
BBC
Wildlife
Finder
BBC
Music
rdfabout
US Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
September '10:
203 datasets
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
PKU/2014-08-28 11
16. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
RESEX
IBM
User-generated content
As of September 2011
Audio
Scrobbler
(DBTune)
Music
Brainz
(zitgist)
P20
Turismo
de
Zaragoza
yovisto
Yahoo!
Geo
Planet
World
Fact-book
Moseley
Folk
YAGO
El
Viajero
Tourism
BBC
Program
mes BBC
Geo
Names
WordNet
(VUA)
WordNet
(W3C)
VIVO UF
Calames
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
Source Code
Ecosystem
Linked Data
UniProt
PubMed
UniRef
UMBEL
UK Post-codes
legislation
data.gov.uk
Uberblic
UB
Mann-heim
TWC LOGD
Twarql
transport
data.gov.
uk
Traffic
Scotland
theses.
fr
Thesau-rus
W
totl.net
Tele-graphis
Semantic
Tweet
TCM
Gene
DIT
Taxon
Concept
Open
Library
(Talis)
tags2con
delicious
t4gm
info
Swedish
Open
Cultural
Heritage
Surge
Radio
Sudoc
STW
RAMEAU
SH
statistics
data.gov.
uk
St.
Andrews
Resource
Lists
ECS
South-ampton
EPrints
SSW
Thesaur
us
Linked
User
Feedback
gnoss
Greek
DBpedia
Smart
Link
Slideshare
2RDF
semantic
web.org
GovTrack
Semantic
XBRL
SW
Dog
Food
US SEC
(rdfabout)
Sears
Scotland
Pupils
Exams
Scotland
Geo-graphy
Scholaro-meter
WordNet
(RKB
Explorer)
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTI
JISC
IRIT
IEEE
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP
(RKB
Explorer)
Crime
Reports
UK
Course-ware
CORDIS
(RKB
Explorer)
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov.
Ren. uk
Energy
Genera-tors
reference
data.gov.
uk
Recht-spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
Rådata
nå!
PSH
Product
Types
Ontology
Product
DB
PBAC
Poké-pédia
patents
data.go
v.uk
Ox
Points
Ord-nance
Survey
Openly
Local
Open
Library
Open
Cyc
Open
Corpo-rates
Open
Calais
OpenEI
Open
Election
Data
Project
Open
Data
Thesau-rus
Ontos
News
Portal
OGOLOD
Ocean
Drilling
Codices
Janus
AMP
New
York
Times
NVD
ntnusc
NTU
Resource
Lists
Norwe-gian
MeSH
NDL
subjects
ndlna
my
Experi-ment
Italian
Museums
medu-cator
MARC
Codes
List
Man-chester
Reading
Lists
Lotico
Weather
Stations
London
Gazette
LOIUS
Linked
Open
Colors
lobid
Resources
lobid
Organi-sations
LEM
Linked
MDB
LinkedL
CCN
Linked
GeoData
LinkedCT
LOV
Linked
Open
Numbers
LODE
Eurostat
(Ontology
Central)
Linked
EDGAR
(Ontology
Central)
Linked
Crunch-base
lingvoj
Lichfield
Spen-ding
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National
Radio-activity
JP
Jamendo
(DBtune)
Italian
public
schools
ISTAT
Immi-gration
iServe
IdRef
Sudoc
NSZL
Catalog
Hellenic
PD
Hellenic
FBD
Piedmont
Accomo-dations
GovWILD
Google
Art
wrapper
GESIS
GeoWord
Net
Geo
Species
Geo
Linked
Data
GEMET
GTAA
STITCH
SIDER
Project
Guten-berg
Medi
Care
Euro-stat
(FUB)
EURES
Drug
Bank
Disea-some
DBLP
(FU
Berlin)
Daily
Med
CORDIS
(FUB)
Freebase
flickr
wrappr
Fishes
of Texas
Finnish
Munici-palities
ChEMBL
FanHubz
Event
Media
EUTC
Produc-tions
Eurostat
Europeana
EUNIS
EU
Insti-tutions
ESD
stan-dards
EARTh
Enipedia
Popula-tion
(En-
AKTing)
NHS
(En-
AKTing) Mortality
(En-
AKTing)
Energy
(En-
AKTing)
Crime
(En-
AKTing)
CO2
Emission
(En-
AKTing)
EEA
SISVU
educatio
n.data.g
ov.uk
ECS
South-ampton
ECCO-TCP
GND
Didactal
ia
DDC Deutsche
Bio-graphie
data
dcs
Music
Brainz
(DBTune)
Magna-tune
John
Peel
(DBTune)
Classical
(DB
Tune)
Last.FM
artists
(DBTune)
DB
Tropes
Portu-guese
DBpedia
dbpedia
lite
DBpedia
data-open-ac-
uk
SMC
Journals
Pokedex
Airports
NASA
(Data
Incu-bator)
Music
Brainz
(Data
Incubator)
Metoffice
Weather
Forecasts
Discogs
(Data
Incubator)
Climbing
data.gov.uk
intervals
Data
Gov.ie
data
bnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2
Bio2RDF
business
data.gov.
uk
Bricklink
Brazilian
Poli-ticians
BNB
UniSTS
UniPath
way
UniParc
Taxono
my
UniProt
(Bio2RDF)
SGD
Reactome
Pub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIM
MGI
KEGG
Reaction
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Drug
KEGG
Com-pound
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Affy-metrix
bible
ontology
BibBase
FTS
BBC
Wildlife
Finder
Music
Alpine
Ski
Austria
LOCAH
Amster-dam
Museum
AGROV
OC
AEMET
US Census
(rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
September '11:
295 datasets, 25B
triples
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
PKU/2014-08-28 11
17. RDF Data Volumes . . .
. . . are growing { and fast
Linked data cloud currently consists of 325 datasets with
25B triples
Size almost doubling every year
April '14:
1091 datasets, ???
triples
Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of Linked
Data Best Practices in Dierent Topical Domains. In Proc. ISWC, 2014.
PKU/2014-08-28 11
20. Three Approaches
Data warehousing
Consolidate data in a repository and query it
SPARQL federation
Leverage query services provided by data publishers
Live Linked Data querying
Navigate through LOD by looking up URIs at query execution
time
PKU/2014-08-28 14
40. ned
Resource descriptions can be
contributed by dierent
people/groups and can be located
anywhere in the web
Integrated web database
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName Jack Nicholson
y:JN29704:BornOnDate 1937-04-22
JN29704:movieActor
y:TS2014
y:TS2014:title The Shining
y:TS2014:releaseDate 1980-05-23
PKU/2014-08-28 19
41. RDF Data Model
Triple: Subject, Predicate (Property),
Object (s; p; o)
Subject: the entity that is described
(URI or blank node)
Predicate: a feature of the entity (URI)
Object: value of the feature (URI,
blank node or literal)
(s; p; o) 2 (U [ B) U (U [ B [ L)
Set of RDF triples is called an RDF graph
U
Predicate
Subject Object
U B U B L
U: set of URIs
B: set of blank nodes
L: set of literals
Subject Predicate Object
http://...imdb.../
62. lm/3418 rdfs:label The Passenger
geo:2635167 gn:name United Kingdom
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer
lexvo:iso639-3/eng rdfs:label English
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
PKU/2014-08-28 21
63. RDF Graph
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
64. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
70. nite set D (documents), a Web of Linked
Data is a tuple W = (D; adoc; data) where:
I D D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to
73. nite set D (documents), a Web of Linked
Data is a tuple W = (D; adoc; data) where:
I D D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to
74. nite sets of RDF triples.
Web of Linked Data
A Web of Linked Data W = (D; adoc; data)
contains a data link from document d 2 D to
document d0 2 D if there exists a URI u such
that:
I u is mentioned in an RDF triple
t 2 data(d), and
I d0 = adoc(u).
PKU/2014-08-28 23
75. RDF Query Model { SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of
variables), a SPARQL expression is de
76. ned recursively:
an atomic triple pattern, which is an element of
(U [ V) (U [ V) (U [ V [ L)
?x rdfs:label The Shining
P FILTER R, where P is a graph pattern expression and R is a
built-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph
pattern expressions
Example:
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
PKU/2014-08-28 24
77. SPARQL Queries
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
PKU/2014-08-28 25
79. Nave Triple Store Design
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
Subject Property Object
mdb:
97. Nave Triple Store Design
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
Subject Property Object
mdb:
114. lm/3418 rdfs:label The Passenger
geo:2635167 gn:name United Kingdom
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer
lexvo:iso639-3/eng rdfs:label English
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p= r d f s : l a b e l
AND T2 . p=movie : r e l a t e dBo o k
AND T3 . p=movie : d i r e c t o r
AND T4 . p= r e v : r a t i n g
AND T5 . p=movie : d i r e c t o r n ame
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o 4 . 0
AND T5 . o= S t a n l e y Kubr i ck
PKU/2014-08-28 27
115. Nave Triple Store Design
SELECT ?name
WHERE f
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : d i r e c t o r n ame S t a n l e y Kubr i ck .
?m movie : r e l a t e dBo o k ?b . ?b r e v : r a t i n g ? r .
FILTER(? r 4 . 0 )
g
Subject Property Object
mdb:
132. lm/3418 rdfs:label The Passenger
geo:2635167 gn:name United Kingdom
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOer bm:oers/0743424425amazonOer
lexvo:iso639-3/eng rdfs:label English
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
Easy to implement
but
too many self-joins!
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p= r d f s : l a b e l
AND T2 . p=movie : r e l a t e dBo o k
AND T3 . p=movie : d i r e c t o r
AND T4 . p= r e v : r a t i n g
AND T5 . p=movie : d i r e c t o r n ame
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o 4 . 0
AND T5 . o= S t a n l e y Kubr i ck
PKU/2014-08-28 27
133. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Original triple table
Subject Property Object
mdb:
139. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic
order
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
PKU/2014-08-28 28
140. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic
order
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
PKU/2014-08-28 28
141. Exhaustive Indexing{Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
... ...
ID Value
0 mdb:
142. lm/2014
1 rdfs:label
2 The Shining
3 movie:initial release date
4 1980-05-23
5 mdb:director/8476
6 movie:director name
7 Stanley Kubrick
8 mdb:
144. Exhaustive Indexing{Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
... ...
ID Value
0 mdb:
145. lm/2014
1 rdfs:label
2 The Shining
3 movie:initial release date
4 1980-05-23
5 mdb:director/8476
6 movie:director name
7 Stanley Kubrick
8 mdb:
146. lm/2685
9 movie:director
Advantages
I Eliminates some of the joins { they become range queries
I Merge join is easy and fast
PKU/2014-08-28 29
147. Exhaustive Indexing{Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
... ...
ID Value
0 mdb:
148. lm/2014
1 rdfs:label
2 The Shining
3 movie:initial release date
4 1980-05-23
5 mdb:director/8476
6 movie:director name
7 Stanley Kubrick
8 mdb:
149. lm/2685
9 movie:director
Advantages
I Eliminates some of the joins { they become range queries
I Merge join is easy and fast
Disadvantages
I Space usage
PKU/2014-08-28 29
150. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:
156. lm/2685 The Clockwork Orange mob:director/8476
Subject movie:actor name
mdb:actor Jack Nicholson
PKU/2014-08-28 30
157. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:
163. lm/2685 The Clockwork Orange mob:director/8476
Subject movie:actor name
mdb:actor Jack Nicholson
Advantages
I Fewer joins
I If the data is structured, we have a relational system { similar
to normalized relations
PKU/2014-08-28 30
164. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
Subject Property Object
mdb:
170. lm/2685 The Clockwork Orange mob:director/8476
Subject movie:actor name
mdb:actor Jack Nicholson
Advantages
I Fewer joins
I If the data is structured, we have a relational system { similar
to normalized relations
Disadvantages
I Potentially a lot of NULLs
I Clustering is not trivial
I Multi-valued properties are complicated
PKU/2014-08-28 30
171. Binary Tables
Grouping by properties: For each property, build a two-column
table, containing both subject and object, ordered by subjects
[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:
179. lm/2685 The Clockwork Orange
movie:actor name
Subject Object
mdb:actor/29704 Jack Nicholson
PKU/2014-08-28 31
180. Binary Tables
Grouping by properties: For each property, build a two-column
table, containing both subject and object, ordered by subjects
[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:
188. lm/2685 The Clockwork Orange
movie:actor name
Subject Object
mdb:actor/29704 Jack Nicholson
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
PKU/2014-08-28 31
189. Binary Tables
Grouping by properties: For each property, build a two-column
table, containing both subject and object, ordered by subjects
[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in
the data)
Subject Property Object
mdb:
197. lm/2685 The Clockwork Orange
movie:actor name
Subject Object
mdb:actor/29704 Jack Nicholson
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
Disadvantages
I Not useful for subject-object joins
I Expensive inserts
PKU/2014-08-28 31
198. Graph-based Approach
Answering SPARQL query subgraph matching
gStore [Zou et al., 2011, 2014]
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
Subgraph Matching
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
199. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
204. Graph-based Approach
Answering SPARQL query subgraph matching
gStore [Zou et al., 2011, 2014]
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
Subgraph Matching
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
205. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
209. lm/424
refs:label
Spartacus
mdb:actor/30013
movie:relatedBook
scam:hasOer
foaf:based near
movie:actor
movie:director
movie:actor
movie:director movie:director
Advantages
I Maintains the graph structure
I Full set of queries can be handled
PKU/2014-08-28 32
210. Graph-based Approach
Answering SPARQL query subgraph matching
gStore [Zou et al., 2011, 2014]
FILTER(?r 4.0)
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
Stanley Kubrick
movie:director name
?r
rev:rating
Subgraph Matching
United Kingdom
gn:name
The Passenger
refs:label
62348447
gn:population
mdb:
211. lm/2014
movie:initial release date
1980-05-23
bm:oers/0743424425amazonOer
The Shining
refs:label
bm:books/0743424425
4.7
rev:rating
geo:2635167
The Last Tycoon
refs:label
movie:actor movie:actor
mdb:actor/29704
movie:actor name
Jack Nicholson
mdb:
215. lm/424
refs:label
Spartacus
mdb:actor/30013
movie:relatedBook
scam:hasOer
foaf:based near
movie:actor
movie:director
movie:actor
movie:director movie:director
Advantages
I Maintains the graph structure
I Full set of queries can be handled
Disadvantages
I Graph pattern matching is expensive
PKU/2014-08-28 32
216. gStore
General Approach:
Work directly on the RDF graph and the SPARQL query graph
Use a signature-based encoding of each entity and class vertex
to speed up matching
Filter-and-evaluate
Use a false positive algorithm to prune nodes and obtain a set
of candidates; then do more detailed evaluation on those
Use an index (VS-tree) over the data signature graph (has
light maintenance load) for ecient pruning
PKU/2014-08-28 33
217. 1. Encode Q and G to Get Signature Graphs
Query signature graph Q
00010
0100 0000 1000 0000
0000 0100
10000
Data signature graph G
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
1001 1000
01000
0001 0001
01000
0100 1000
01000
0001 0100
01000
PKU/2014-08-28 34
218. 2. Filter-and-Evaluate
Query signature graph Q
00010
0100 0000 1000 0000
0000 0100
10000
Data signature graph G
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
1001 1000
01000
0001 0001
01000
0100 1000
01000
0001 0100
01000
Find matches of Q over
signature graph G
Verify each match in
RDF graph G
PKU/2014-08-28 35
219. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
PKU/2014-08-28 36
220. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
PKU/2014-08-28 36
221. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G
Both steps are inecient
PKU/2014-08-28 36
222. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G
Both steps are inecient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q and get lists of
nodes in G that include that node.
Given query signature q and a set of data signatures S,
223. nd all data signatures si 2 S where qsi = q
Does not support second step { expensive
PKU/2014-08-28 36
224. How to Generate Candidate List
Two step process:
1. For each node of Q get lists of nodes in G that include that
node.
2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G
Both steps are inecient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q and get lists of
nodes in G that include that node.
Given query signature q and a set of data signatures S,
225. nd all data signatures si 2 S where qsi = q
Does not support second step { expensive
VS-tree (and VS-tree)
Multi-resolution summary graph based on S-tree
Supports both steps eciently
Grouping by vertices
PKU/2014-08-28 36
240. Adaptivity to Workload
Web applications that are supported by RDF data
management systems are far more varied than conventional
relational applications
Data that are being handled are far more heterogeneous
SPARQL is far more
exible in how triple patterns (i.e., the
atomic query unit) can be combined
An experiment [Aluc et al., 2014]
RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store
% queries for which
tested system is
fastest
20.9 0.0 22.6 56.5 0.0
Total workload exe-
cution time (hours)
27.1 20.9 20.8 38.6 72.2
Mean (per query)
execution time (sec-
onds)
7.8 6.0 6.0 11.1 20.7
PKU/2014-08-28 40
241. Adaptivity to Workload
Web applications that are supported by RDF data
management systems are far more varied than conventional
relational applications
Data that are being handled are far more heterogeneous
SPARQL is far more
exible in how triple patterns (i.e., the
atomic query unit) can be combined
An experiment [Aluc et al., 2014]
Summary of Experiments
RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store
% I queries No single for which
system is a sole 20.9 winner across 0.0 all queries
22.6 56.5 0.0
tested I system is
No single system is the sole loser across all queries, either
fastest
I There can be 2{5 orders of magnitude dierence in the performance (i.e., query
Total workload exe-
cution time (hours)
27.1 20.9 20.8 38.6 72.2
execution time) between the best and the worst system for a given query
I The winner in one query may timeout in another
I Performance dierence widens as dataset size increases
Mean (per query)
execution time (sec-
onds)
7.8 6.0 6.0 11.1 20.7
PKU/2014-08-28 40
246. xed size, (b) contain
same set of attributes
1. Workload time analysis
2. Updating the physical layout
Cache
Storage System
Hash
Function
evict
@t1
function
adapts
Hash
Function
@tk
PKU/2014-08-28 42
248. xed size, (b) contain
same set of attributes
1. Workload time analysis
2. Updating the physical layout
3. Partial indexing
Index { { { { { { { { { {
Cache
Storage System
Hash
Function
evict
@t1
function
adapts
Hash
Function
@tk
SPARQL Query Engine
PKU/2014-08-28 42
249. chameleon-db
Prototype system [Aluc et al., 2013]
35,000 lines of code in C++ and growing
Structural Index
...
Vertex Index
Spill Index
Storage System Cluster Index
Storage Advisor
Query
Engine Plan Generation Evaluation
PKU/2014-08-28 43
250. Some Open Problems
Scalability of the solutions to very large datasets
Maintenance of auxiliary data structures in dynamic
environments
Adaptive systems to handle varying and time-changing
workloads
Uncertain RDF data processing
Keyword search over RDF data
Query processing over incomplete RDF data
PKU/2014-08-28 44
252. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
PKU/2014-08-28 46
253. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
PKU/2014-08-28 46
254. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
Data re-distribution +
query decomposition
PKU/2014-08-28 46
255. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
Data re-distribution +
query decomposition
SPARQL federation: just
process at SPARQL
endpoints
PKU/2014-08-28 46
256. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries {
SPARQL endpoints
Not all data sites can
process queries
Alternatives
Data re-distribution +
query decomposition
SPARQL federation: just
process at SPARQL
endpoints
Live querying (see next
section)
PKU/2014-08-28 46
257. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
PKU/2014-08-28 47
258. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
PKU/2014-08-28 47
259. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg over fD1; : : : ;Dng
PKU/2014-08-28 47
260. Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = fD1; : : : ;Dng
Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg over fD1; : : : ;Dng
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data
(i.e., LOD)
PKU/2014-08-28 47
261. Distributed RDF Processing { 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets
(e.g., [Atre et al., 2010; Prasser et al., 2012])
PKU/2014-08-28 48
262. Distributed RDF Processing { 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets
(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg using the data
summary
PKU/2014-08-28 48
263. Distributed RDF Processing { 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets
(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = fQ1; : : : ;Qkg
Distributed execution of fQ1; : : : ;Qkg using the data
summary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
PKU/2014-08-28 48
264. SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint
PKU/2014-08-28 49
265. SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = fQ1; : : : ;Qkg and executed
over fD1; : : : ;Dng { DARQ, FedX [Schwarte et al., 2011],
SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acosta
et al., 2011]
Partial query evaluation { Distributed gStore [Peng et al.,
2014]
PKU/2014-08-28 49
266. SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 [ D2 [ : : : [ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = fQ1; : : : ;Qkg and executed
over fD1; : : : ;Dng { DARQ, FedX [Schwarte et al., 2011],
SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acosta
et al., 2011]
Partial query evaluation { Distributed gStore [Peng et al.,
2014]
Partial evaluation
I Given function f (s; d) and part of its input s, perform f 's
computation that only depends on s to get f 0(d)
I Compute f 0(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
PKU/2014-08-28 49
267. Distributed SPARQL Using Partial Query Evaluation
Two steps:
1. Evaluate a query at each site to
268. nd local matches
Query is the function and each Di is the known input
Inner match or local partial match
D1
D2
D3
D4
PKU/2014-08-28 50
269. Distributed SPARQL Using Partial Query Evaluation
Two steps:
1. Evaluate a query at each site to
270. nd local matches
Query is the function and each Di is the known input
Inner match or local partial match
2. Assemble the partial matches to get
271. nal result
Crossing match
Centralized assembly
Distributed assembly
D1
D2
D3
D4
Crossing match
PKU/2014-08-28 50
275. Live Query Processing
Not all data resides at
SPARQL endpoints
Freshness of access to data
important
Potentially countably in
276. nite
data sources
Live querying
On-line execution
Only rely on linked data
principles
Alternatives
Traversal-based
approaches
Index-based approaches
Hybrid approaches
PKU/2014-08-28 53
277. SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
PKU/2014-08-28 54
278. SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S,
and a reachability condition c
Scope: all data along paths of data links that satisfy the
condition
Computationally feasible
PKU/2014-08-28 54
280. c) data links
at query execution runtime [Hartig,
2013; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
PKU/2014-08-28 55
282. c) data links
at query execution runtime [Hartig,
2013; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Advantages
Easy to implement.
No data structure to maintain.
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
PKU/2014-08-28 55
284. c) data links
at query execution runtime [Hartig,
2013; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Advantages
Easy to implement.
No data structure to maintain.
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
Disadvantages
Possibilities for parallelized data retrieval are limited
Repeated data retrieval introduces signi
286. Traversal Optimization
Dynamic query execution [Hartig and Ozsu, 2014]
Data Retrieval
...lookup queue...
Output
PKU/2014-08-28 56
287. Traversal Optimization
Dynamic query execution [Hartig and Ozsu, 2014]
Prioritization of URIs { a number of alternatives
Non-adaptive
Adaptive,
Local processing aware
Adaptive,
Local processing agnostic
Intermediate solution driven Solution-aware graph-based
Hybrid graph-based Purely graph-based
PKU/2014-08-28 56
288. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Dierent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Key: tp Entry: furi1; uri2; ; uring
GET urii
PKU/2014-08-28 57
289. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Dierent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
Key: tp Entry: furi1; uri2; ; uring
GET urii
PKU/2014-08-28 57
290. Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Dierent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
Key: tp Entry: furi1; uri2; ; uring
Disadvantages
Querying can only start after index GET construction
urii
Depends on what has been selected for the index
Freshness may be an issue
Index maintenance
PKU/2014-08-28 57
291. Hybrid Approach
Perform a traversal-based execution using a prioritized list of
URIs to look up [Ladwig and Tran, 2010]
Initial seed from the pre-populated index
Non-seed URIs are ranked by a function based on information
in the index
New discovered URIs that are not in the index are ranked
according to number of referring documents
PKU/2014-08-28 58
292. Some Open Problems
Optimize queries by using statistics collected during earlier
query executions
Heterogeneous use of vocabularies (use of ontologies)
Combine SPARQL federation to leverage SPARQL endpoint
functionality
PKU/2014-08-28 59
294. Conclusions
RDF and Linked Object Data seem to have considerable
promise for Web data management
2014 2011
PKU/2014-08-28 61
295. Conclusions
RDF and Linked Object Data seem to have considerable
promise for Web data management
More work needs to be done
Query semantics
Adaptive system design
Optimizations { both in data warehousing and distributed
environments
Live querying requires signi
297. Conclusions
What I did not talk about:
Not much on general distributed/parallel processing
Not much on SPARQL semantics
Nothing about RDFS { no schema stu
Nothing about entailment regimes 0 ) no reasoning
PKU/2014-08-28 62
299. References I
Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a
vertically partitioned DBMS for semantic web data management. VLDB J.,
18(2):385{406.
Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable
semantic web data management using vertical partitioning. In Proc. 33rd
Int. Conf. on Very Large Data Bases, pages 411{422.
Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. (1997). The
Lorel query language for semistructured data. Int. J. Digit. Libr., 1(1):68{88.
Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011).
ANAPSID: an adaptive query processing engine for SPARQL endpoints. In
Proc. 10th Int. Semantic Web Conf., pages 18{34.
Aluc, G., Hartig, O., Ozsu, M. T., and Daudjee, K. (2014). Diversi
300. ed stress
testing of RDF data management systems. In Proc. 13th Int. Semantic Web
Conf. Forthcoming.
Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: a
workload-aware robust RDF data management system. Technical Report
CS-2013-10, University of Waterloo.
PKU/2014-08-28 64
301. References II
Arocena, G. and Mendelzon, A. (1998). Weboql: Restructuring documents,
databases and webs. In Proc. 14th Int. Conf. on Data Engineering, pages
24{33.
Atre, M., Chaoji, V., Zaki, M. J., and Hendler, J. A. (2010). Matrix bit
loaded: A scalable lightweight join query processor for rdf data. In Proc.
19th Int. World Wide Web Conf., pages 41{50.
Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P.,
Udrea, O., and Bhattacharjee, B. (2013). Building an ecient RDF store
over a relational database. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, pages 121{132.
Buneman, P., Cong, G., Fan, W., and Kementsietsidis, A. (2006). Using partial
evaluation in distributed query evaluation. In Proc. 32nd Int. Conf. on Very
Large Data Bases, pages 211{222.
Buneman, P., Davidson, S., Hillebrand, G. G., and Suciu, D. (1996). A query
language and optimization techniques for unstructured data. In Proc. ACM
SIGMOD Int. Conf. on Management of Data, pages 505{516.
Fernandez, M., Florescu, D., and Levy, A. (1997). A query language for a
web-site management system. ACM SIGMOD Rec., 26(3):4{11.
PKU/2014-08-28 65
302. References III
Gorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation
exploiting VOID descriptions. In Proc. 2nd Int. Workshop on Consuming
Linked Data.
Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: A
distributed shared-nothing RDF engine based on asynchronous message
passing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
289{300.
Hartig, O. (2012). SPARQL for a web of linked data: Semantics and
computability. In Proc. 9th Extended Semantic Web Conf., pages 8{23.
Hartig, O. (2013). SQUIN: a traversal based query execution system for the
web of linked data. In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 1081{1084.
Hartig, O. and Ozsu, M. T. (2014). Optimizing response time of
traversal-based query optimization. In preparation.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of
large RDF graphs. Proc. VLDB Endowment, 4(11):1123{1134.
PKU/2014-08-28 66
303. References IV
Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham,
B. (2011). Heuristics-based query processing for large RDF graphs using
cloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312{1327.
Kaoudi, Z. and Manolescu, I. (2014). RDF in the clouds: A survey. VLDB J.
Forthcoming.
Konopnicki, D. and Shmueli, O. (1995). W3QS: A query system for the World
Wide Web. In Proc. 21th Int. Conf. on Very Large Data Bases, pages 54{65.
Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. In
Proc. 9th Int. Semantic Web Conf., pages 453{469.
Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linked
data. In Proc. 8th Extended Semantic Web Conf., pages 139{153.
Lakshmanan, L. V. S., Sadri, F., and Subramanian, I. N. (1996). A declarative
language for querying and restructuring the Web. In Proc. 6th Int.
Workshop on Research Issues on Data Eng., pages 12{21.
Lee, K. and Liu, L. (2013). Scaling queries over big rdf graphs with semantic
hash partitioning. Proc. VLDB Endowment, 6(14):1894{1905.
PKU/2014-08-28 67
304. References V
Mendelzon, A. O., Mihaila, G. A., and Milo, T. (1997). Querying the World
Wide Web. Int. J. Digit. Libr., 1(1):54{67.
Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF.
Proc. VLDB Endowment, 1(1):647{659.
Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalable
management of RDF data. VLDB J., 19(1):91{113.
Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. (1995). Object
exchange across heterogeneous information sources. In Proc. 11th Int. Conf.
on Data Engineering, pages 251{260.
Peng, P., Zou, L., Ozsu, M. T., Chen, L., and Zhao, D. (2014). Processing
SPARQL queries over linked data { a distributed graph-based approach. In
submitted for publication.
Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Ecient distributed query
processing for autonomous rdf databases. In Proc. 15th Int. Conf. on
Extending Database Technology, pages 372{383.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011).
Fedx: A federation layer for distributed query processing on linked open
data. In Proc. 8th Extended Semantic Web Conf., pages 481{486.
PKU/2014-08-28 68
305. References VI
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011).
Comparing data summaries for processing live queries over linked data.
World Wide Web J., 14(5-6):495{544.
Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexing
for semantic web data management. Proc. VLDB Endowment,
1(1):1008{1019.
Wilkinson, K. (2006). Jena property table implementation. Technical Report
HPL-2006-140, HP Laboratories Palo Alto.
Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towards
scalable I/O ecient SPARQL query evaluation on the cloud. In Proc. 29th
Int. Conf. on Data Engineering, pages 565{576.
Zou, L., Mo, J., Chen, L., Ozsu, M. T., and Zhao, D. (2011). gStore:
answering SPARQL queries via subgraph matching. Proc. VLDB
Endowment, 4(8):482{493.
Zou, L., Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014).
gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565{590.
PKU/2014-08-28 69