SlideShare a Scribd company logo
1 of 26
TRIPLESTORE AND
SPARQL
Lino Valdivia Jr
04.06.2013
OUTLINE
The Semantic Web
RDF
SPARQL
Triplestores
Apache Jena
DBPedia
Conclusions
Demo1: Apache Jena API
Demo2: DBPedia
THE SEMANTIC WEB
Most of the data in the web consists of unstructured or
semi-structured data
 HTML documents
 Multimedia: images, video streams, audio files
 Meant to read and processed by humans
What if we can structure and add metadata to this “Web
of Documents”, and make them understandable by
machines?
 Metadata → meaning, or semantics
 Machines can perform new tasks that used to require human
intervention
This is the motivation behind the Semantic Web!
 The term “Semantic Web” was initially coined by Tim Berners-Lee: “a
web of data that can be processed directly and indirectly by
machines.”
THE SEMANTIC WEB
“The Semantic Web is a web of data…[it] provides a common
framework that allows data to be shared and reused across
application, enterprise, and community boundaries.”
[w3.org]
For the Semantic Web to happen, we would need
1. A way to structure and link data in a standardized way
2. A way to describe the relationships of these data in a
common way
3. A way to query that linked data
4. A way to infer something from that linked data (by
applying a set of rules)
but we will only focus on #1 and #3
RDF: A WAY TO STRUCTURE AND
LINK DATA
RDF = Resource Description Framework, a standard way
for applications to represent information that can then
be shared and processed
A resource can be anything that is identifiable: a user, a
coffee cup, a picture of your cat, a bank statement
RDF provides a way to model data by breaking it down
into three components:
The subject
The object
The predicate (aka the property).
RDF AS A GRAPH
Consider the following statement: Jordi lives in Barcelona
 Subject: Jordi
 Object: Barcelona
 Predicate: lives-in (or, to be more precise, address-city)
RDFs are typically represented as a labeled directed
graph:
 The arrow points from the subject to the object
Jordi Barcelo
na
address-
city
RDFS AND URIS
Resources must be identifiable, and RDF uses Uniform
Resource Identifier (URI) references.
E.g. Jordi = http://example.org/Jordi
URIs <> URLs!!!
RDF graphs are typically shown with the URIs for the subject, object,
and predicate:
The RDF graph can also be rewritten in text as:
<http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> .
As you may have guessed, RDF is more machine-friendly than
human-friendly!
http://...Jord
i
http://.../Barcel
ona
http://.../address
-city
RDF: RESOURCES AND
LITERALS
The object of a triple in RDF can either be a resource
(identified by URIs) or a literal (values such as strings and
numbers):
We can represent the RDF graph above as text as:
<http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> .
<http://example.org/Jordi> <http://example.org/firstname> “Jordi” .
<http://example.org/Jordi> <http://example.org/age> “37” .
This textual representation is also known as Terse RDF Triple
Language, or Turtle for short.
http://...Jord
i
http://.../Barcel
ona
http://...address-
city
“Jordi” 37
http://...agehttp://...firstna
me
RDF: PREFIXES
Prefixes can be used to simplify representations, either in
graphs:
prefix ex: http://example.org
or in Turtle:
@prefix ex:<http://example.org/> .
ex:Jordi ex:address-city ex:Barcelona .
ex:Jordi ex:firstname “Jordi” .
ex:Jordi ex:age “37” .
Now that we have a way to structure and link our data, we
want to be able to query it for information.
ex:Jordi ex:Barcelona
ex:address-city
“Jordi” 37
ex:ageex:firstname
SPARQL: A WAY TO QUERY
LINKED DATA
SPARQL = SPARQL Protocol and RDF Query Language
SPARQL 1.1 became a W3C Recommendation on March
2013!
Example: given our RDF graph, show all users who live in
Barcelona:
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
}
SPARQL AND GRAPH
PATTERNS
The statements in the WHERE clause form a graph
pattern, which is matched against subgraphs in the RDF
graph to form the solution.
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
}
ex:Jordi
ex:Barcelon
a
ex:address-city
“Jordi
”
37
ex:ageex:firstna
me
ex:Badalon
a
ex:Josep
ex:address-city
SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
FILTERs, ORDER BYs, LIMITs, and OFFSETs:
Show the names of users who live in Barcelona and are less
than 40 years old, starting from the 11th to 40th user:
PREFIX ex: <http://example.com/>
SELECT ?lname ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
?user ex:lastname ?lname .
?user ex:age ?age
FILTER (?age < 40)
}
ORDER BY ?lname
LIMIT 30
OFFSET 10
SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
Alternative matches using UNION, for those cases
where resources in the expected result set may match
multiple patterns:
Show the first names of users who live in Barcelona or
in Badalona:
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:firstname ?fname .
{
{ ?user ex:address-city ex:Barcelona . }
UNION
{ ?user ex:address-city ex:Badalona . }
}
}
SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
OPTIONAL matches, for those cases where not all
resources in the expected result set do not have to match a
pattern:
Show the first names of users who live in Barcelona and
their profile pic image, if they have one:
PREFIX ex: <http://example.com/>
SELECT ?fname ?ppic
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
OPTIONAL {
?user ex:ppic ?ppic .
}
}
SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
Set inclusion (IN/NOT IN)
GROUP BY, HAVING, and aggregate functions such
as COUNT and AVG (new in SPARQL 1.1)
Subqueries (new in SPARQL 1.1)
SPARQL: OTHER OPERATIONS
Aside from SELECTs for querying, SPARQL also has
CONSTRUCT – creates a single RDF graph from the
result of a query by combining (i.e. applying set union
on) the resulting triples
ASK – returns a Boolean that indicates whether the
query is resolvable or not
DESCRIBE – returns an RDF graph that describes the
result (as determined by the query service)
INSERT/DELETE – adds or removes triples from the
graph (new in SPARQL 1.1)
Graph management operations (CREATE, DROP, COPY,
MOVE, ADD) (new in SPARQL 1.1)
TRIPLESTORES
The statements in an RDF graph (subject-predicate-object) are also
known as triples, and the specialized database used for storing
them are called triplestores.
Triplestores vs Graph Databases – What’s the diff?
Triplestores are especially designed to store RDF graphs, which
are labeled directed graphs
On the other hand, graph databases can store any kind of graph
(unlabeled, undirected, weighted, etc.)
Graph databases don’t have a standard query language (Cypher?)
Triplestores must support SPARQL
Triplestores are optimized for graph pattern matching, and may
lack the full capabilities of graph DBs
But graph databases can be used to implement a triplestore
(see Sequeda, J. (2013, January 31) Introduction to
Triplestores)
SPARQL AND CYPHER
SPARQL:
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
}
Cypher:
MATCH user–[:ex_firstname]->fname,
user-[:ex_address-city]->city
WHERE city.uri = “ex:Barcelona”
RETURN fname
ex:Jordi
ex:Barcelon
a
ex:address-city
“Jordi
”
37
ex:ageex:firstna
me
TRIPLESTORE
IMPLEMENTATIONS
Native Triplestores
Sesame
BigData
Meronymy
Apache Jena TDB
Graph DB-based
AllegroGraph
Oracle Spatial and Graph (formerly Oracle Semantic Technologies)
Relational DB-based
Apache Jena SDB
IBM DB2
APACHE JENA
Born in HP Labs in 2000, became a top-level Apache
project in April 2012
The Jena Framework includes
A Java API for working with RDF models
A SPARQL query processor
An efficient disk-based native triplestore
A rule-based inference engine that can be used with
RDF-based ontologies
A server for accepting SPARQL queries over HTTP (a
SPARQL endpoint)
APACHE JENA: RDF API
The Statement interface represents triples, while the Model
interface represents the whole RDF graph
Given a Statement, one could invoke
 getSubject(), which would return a Resource
 getPredicate(), which would return a Property
 getObject(), which would return an RDFNode (which can be a
Resource or a Literal)
To create our example basic RDF graph:
Model model = ModelFactory.createDefaultModel();
Resource j = model.createResource(“http://example.org/Jordi”);
Resource bcn = model.createResource(“http://example.org/Barcelona”);
Property addrCity = model.createProperty(“ex”, “address-city”);
// This automatically creates a Statement in the associated model.
j.addProperty(addrCity, bcn);
APACHE JENA: ARQ API
Jena also provides an API called ARQ for
programmatically executing SPARQL queries.
To execute a given query on our example graph:
String queryString = “...”;
Query query = QueryFactory.create(queryString);
// Associate a query execution context against our model.
QueryExecution qe = QueryExecutionFactory.create(query, model);
ResultSet rs = qe.execSelect();
// ResultSet acts like an Iterator.
for (; rs.hasNext();)
{
QuerySolution qs = rs.nextSolution();
RDFNode r = qs.get(“fname”); // You can get a variable by name.
// Do what you want with it.
}
// Always good to close resources when done.
qe.close();
APACHE JENA: TDB
Jena’s native triplestore implementation is called TDB and
consists of
The node table
stores resources, predicates (relationships), and literals
maps nodes to internal node ids, and vice versa
node ids are 8 bytes (64 bits) long
The triple indexes
stores 3 indexes into the node table
The prefixes table
maps prefixes to URIs
TDB also supports ACID transactions using write-ahead
logging.
But no transaction is needed if there is only one single
writer (even with multiple concurrent readers)
RDF/SPARQL IN ACTION:
DBPEDIA.ORG
DBPedia describes itself as a “crowdsourced community
effort to extract structured information from Wikipedia”
 1.89 billion triples localized in 111 languages
 English dataset contains 3.77 million topics
Imagine if you can ask Wikipedia…
 Which towns in Cataluña have a population between 10,000 and 50,000
people?
 What are the birthdays of all blues guitarists who were born in Chicago?
 (sample query from DBPedia.org wiki) Show me all soccer players who
played as goalkeeper for a club that has a stadium with more than 40,000
seats and who are born in a country with more than 10 million inhabitants
DBPedia also provides a SPARQL endpoint, so other websites
can query its data and get results that are continuously
updated
DBPedia also contains geo-coordinates obtained from other
sources (e.g. Geonames, Eurostat, CIA World Fact Book) –
this opens the possibility for location-based applications
from mobile devices
CONCLUSIONS
The Semantic Web – Web 3.0?
RDF and SPARQL are key
technologies in the W3C’s vision
of the web of tomorrow
Companies like Google, Tesco,
and Best Buy already produce
RDF content!
Add some SPARQL to your
projects!
Source:
w3.org
BIBLIOGRAPHY
Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web.
http://www.scientificamerican.com/article.cfm?id=the-semantic-web
W3 Consortium. (2004, February 10). RDF Primer.
http://www.w3.org/TR/2004/REC-rdf-primer-20040210/
W3 Consortium. (2013, March 21). SPARQL 1.1 Query Language
http://www.w3.org/TR/sparql11-query/
Sequeda, J. (2013, January 31) Introduction to Triplestores
http://semanticweb.com/introduction-to-triplestores_b34996
Apache Jena
http://jena.apache.org/
DBPedia
http://dbpedia.org/

More Related Content

What's hot

What's hot (20)

“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018
Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018
Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
 
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
SHACL Overview
SHACL OverviewSHACL Overview
SHACL Overview
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
Migrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQLMigrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQL
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 

Viewers also liked

Semantic Technologies and Triplestores for Business Intelligence
Semantic Technologies and Triplestores for Business IntelligenceSemantic Technologies and Triplestores for Business Intelligence
Semantic Technologies and Triplestores for Business Intelligence
Marin Dimitrov
 
Introduction to RDF & SPARQL
Introduction to RDF & SPARQLIntroduction to RDF & SPARQL
Introduction to RDF & SPARQL
Open Data Support
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

Viewers also liked (18)

Jena framework
Jena frameworkJena framework
Jena framework
 
Einführung in RDF & SPARQL
Einführung in RDF & SPARQLEinführung in RDF & SPARQL
Einführung in RDF & SPARQL
 
Saveface - Save your Facebook content as RDF data
Saveface - Save your Facebook content as RDF dataSaveface - Save your Facebook content as RDF data
Saveface - Save your Facebook content as RDF data
 
An Introduction to the Jena API
An Introduction to the Jena APIAn Introduction to the Jena API
An Introduction to the Jena API
 
Facebook ( Open ) Graph and the Semantic Web
Facebook ( Open ) Graph and the Semantic WebFacebook ( Open ) Graph and the Semantic Web
Facebook ( Open ) Graph and the Semantic Web
 
SWT Lecture Session 3 - SPARQL
SWT Lecture Session 3 - SPARQLSWT Lecture Session 3 - SPARQL
SWT Lecture Session 3 - SPARQL
 
Semantic Technologies and Triplestores for Business Intelligence
Semantic Technologies and Triplestores for Business IntelligenceSemantic Technologies and Triplestores for Business Intelligence
Semantic Technologies and Triplestores for Business Intelligence
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?
 
Java and SPARQL
Java and SPARQLJava and SPARQL
Java and SPARQL
 
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011Apache Stanbol 
and the Web of Data - ApacheCon 2011
Apache Stanbol 
and the Web of Data - ApacheCon 2011
 
SPARQL Tutorial
SPARQL TutorialSPARQL Tutorial
SPARQL Tutorial
 
Semantic web user interfaces - Do they have to be ugly?
Semantic web user interfaces - Do they have to be ugly?Semantic web user interfaces - Do they have to be ugly?
Semantic web user interfaces - Do they have to be ugly?
 
Introduction to RDF & SPARQL
Introduction to RDF & SPARQLIntroduction to RDF & SPARQL
Introduction to RDF & SPARQL
 
RDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesRDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic Repositories
 
Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to Triplestore and SPARQL

Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
webhostingguy
 

Similar to Triplestore and SPARQL (20)

State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
Web Spa
Web SpaWeb Spa
Web Spa
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Sparql
SparqlSparql
Sparql
 
Danbri Drupalcon Export
Danbri Drupalcon ExportDanbri Drupalcon Export
Danbri Drupalcon Export
 
RDF APIs for .NET Framework
RDF APIs for .NET FrameworkRDF APIs for .NET Framework
RDF APIs for .NET Framework
 
Linked data and voyager
Linked data and voyagerLinked data and voyager
Linked data and voyager
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
SWT Lecture Session 10 R2RML Part 1
SWT Lecture Session 10 R2RML Part 1SWT Lecture Session 10 R2RML Part 1
SWT Lecture Session 10 R2RML Part 1
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
 
Facet: Building Web Pages with SPARQL
Facet: Building Web Pages with SPARQLFacet: Building Web Pages with SPARQL
Facet: Building Web Pages with SPARQL
 
Hacia la Internet del Futuro: Web Semántica y Open Linked Data, Parte 2
Hacia la Internet del Futuro: Web Semántica y Open Linked Data, Parte 2Hacia la Internet del Futuro: Web Semántica y Open Linked Data, Parte 2
Hacia la Internet del Futuro: Web Semántica y Open Linked Data, Parte 2
 
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4jExplicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
Comparative study on the processing of RDF in PHP
Comparative study on the processing of RDF in PHPComparative study on the processing of RDF in PHP
Comparative study on the processing of RDF in PHP
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
A Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic WebA Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic Web
 

Triplestore and SPARQL

  • 2. OUTLINE The Semantic Web RDF SPARQL Triplestores Apache Jena DBPedia Conclusions Demo1: Apache Jena API Demo2: DBPedia
  • 3. THE SEMANTIC WEB Most of the data in the web consists of unstructured or semi-structured data  HTML documents  Multimedia: images, video streams, audio files  Meant to read and processed by humans What if we can structure and add metadata to this “Web of Documents”, and make them understandable by machines?  Metadata → meaning, or semantics  Machines can perform new tasks that used to require human intervention This is the motivation behind the Semantic Web!  The term “Semantic Web” was initially coined by Tim Berners-Lee: “a web of data that can be processed directly and indirectly by machines.”
  • 4. THE SEMANTIC WEB “The Semantic Web is a web of data…[it] provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.” [w3.org] For the Semantic Web to happen, we would need 1. A way to structure and link data in a standardized way 2. A way to describe the relationships of these data in a common way 3. A way to query that linked data 4. A way to infer something from that linked data (by applying a set of rules) but we will only focus on #1 and #3
  • 5. RDF: A WAY TO STRUCTURE AND LINK DATA RDF = Resource Description Framework, a standard way for applications to represent information that can then be shared and processed A resource can be anything that is identifiable: a user, a coffee cup, a picture of your cat, a bank statement RDF provides a way to model data by breaking it down into three components: The subject The object The predicate (aka the property).
  • 6. RDF AS A GRAPH Consider the following statement: Jordi lives in Barcelona  Subject: Jordi  Object: Barcelona  Predicate: lives-in (or, to be more precise, address-city) RDFs are typically represented as a labeled directed graph:  The arrow points from the subject to the object Jordi Barcelo na address- city
  • 7. RDFS AND URIS Resources must be identifiable, and RDF uses Uniform Resource Identifier (URI) references. E.g. Jordi = http://example.org/Jordi URIs <> URLs!!! RDF graphs are typically shown with the URIs for the subject, object, and predicate: The RDF graph can also be rewritten in text as: <http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> . As you may have guessed, RDF is more machine-friendly than human-friendly! http://...Jord i http://.../Barcel ona http://.../address -city
  • 8. RDF: RESOURCES AND LITERALS The object of a triple in RDF can either be a resource (identified by URIs) or a literal (values such as strings and numbers): We can represent the RDF graph above as text as: <http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> . <http://example.org/Jordi> <http://example.org/firstname> “Jordi” . <http://example.org/Jordi> <http://example.org/age> “37” . This textual representation is also known as Terse RDF Triple Language, or Turtle for short. http://...Jord i http://.../Barcel ona http://...address- city “Jordi” 37 http://...agehttp://...firstna me
  • 9. RDF: PREFIXES Prefixes can be used to simplify representations, either in graphs: prefix ex: http://example.org or in Turtle: @prefix ex:<http://example.org/> . ex:Jordi ex:address-city ex:Barcelona . ex:Jordi ex:firstname “Jordi” . ex:Jordi ex:age “37” . Now that we have a way to structure and link our data, we want to be able to query it for information. ex:Jordi ex:Barcelona ex:address-city “Jordi” 37 ex:ageex:firstname
  • 10. SPARQL: A WAY TO QUERY LINKED DATA SPARQL = SPARQL Protocol and RDF Query Language SPARQL 1.1 became a W3C Recommendation on March 2013! Example: given our RDF graph, show all users who live in Barcelona: PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . }
  • 11. SPARQL AND GRAPH PATTERNS The statements in the WHERE clause form a graph pattern, which is matched against subgraphs in the RDF graph to form the solution. PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . } ex:Jordi ex:Barcelon a ex:address-city “Jordi ” 37 ex:ageex:firstna me ex:Badalon a ex:Josep ex:address-city
  • 12. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: FILTERs, ORDER BYs, LIMITs, and OFFSETs: Show the names of users who live in Barcelona and are less than 40 years old, starting from the 11th to 40th user: PREFIX ex: <http://example.com/> SELECT ?lname ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . ?user ex:lastname ?lname . ?user ex:age ?age FILTER (?age < 40) } ORDER BY ?lname LIMIT 30 OFFSET 10
  • 13. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: Alternative matches using UNION, for those cases where resources in the expected result set may match multiple patterns: Show the first names of users who live in Barcelona or in Badalona: PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:firstname ?fname . { { ?user ex:address-city ex:Barcelona . } UNION { ?user ex:address-city ex:Badalona . } } }
  • 14. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: OPTIONAL matches, for those cases where not all resources in the expected result set do not have to match a pattern: Show the first names of users who live in Barcelona and their profile pic image, if they have one: PREFIX ex: <http://example.com/> SELECT ?fname ?ppic FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . OPTIONAL { ?user ex:ppic ?ppic . } }
  • 15. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: Set inclusion (IN/NOT IN) GROUP BY, HAVING, and aggregate functions such as COUNT and AVG (new in SPARQL 1.1) Subqueries (new in SPARQL 1.1)
  • 16. SPARQL: OTHER OPERATIONS Aside from SELECTs for querying, SPARQL also has CONSTRUCT – creates a single RDF graph from the result of a query by combining (i.e. applying set union on) the resulting triples ASK – returns a Boolean that indicates whether the query is resolvable or not DESCRIBE – returns an RDF graph that describes the result (as determined by the query service) INSERT/DELETE – adds or removes triples from the graph (new in SPARQL 1.1) Graph management operations (CREATE, DROP, COPY, MOVE, ADD) (new in SPARQL 1.1)
  • 17. TRIPLESTORES The statements in an RDF graph (subject-predicate-object) are also known as triples, and the specialized database used for storing them are called triplestores. Triplestores vs Graph Databases – What’s the diff? Triplestores are especially designed to store RDF graphs, which are labeled directed graphs On the other hand, graph databases can store any kind of graph (unlabeled, undirected, weighted, etc.) Graph databases don’t have a standard query language (Cypher?) Triplestores must support SPARQL Triplestores are optimized for graph pattern matching, and may lack the full capabilities of graph DBs But graph databases can be used to implement a triplestore (see Sequeda, J. (2013, January 31) Introduction to Triplestores)
  • 18. SPARQL AND CYPHER SPARQL: PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . } Cypher: MATCH user–[:ex_firstname]->fname, user-[:ex_address-city]->city WHERE city.uri = “ex:Barcelona” RETURN fname ex:Jordi ex:Barcelon a ex:address-city “Jordi ” 37 ex:ageex:firstna me
  • 19. TRIPLESTORE IMPLEMENTATIONS Native Triplestores Sesame BigData Meronymy Apache Jena TDB Graph DB-based AllegroGraph Oracle Spatial and Graph (formerly Oracle Semantic Technologies) Relational DB-based Apache Jena SDB IBM DB2
  • 20. APACHE JENA Born in HP Labs in 2000, became a top-level Apache project in April 2012 The Jena Framework includes A Java API for working with RDF models A SPARQL query processor An efficient disk-based native triplestore A rule-based inference engine that can be used with RDF-based ontologies A server for accepting SPARQL queries over HTTP (a SPARQL endpoint)
  • 21. APACHE JENA: RDF API The Statement interface represents triples, while the Model interface represents the whole RDF graph Given a Statement, one could invoke  getSubject(), which would return a Resource  getPredicate(), which would return a Property  getObject(), which would return an RDFNode (which can be a Resource or a Literal) To create our example basic RDF graph: Model model = ModelFactory.createDefaultModel(); Resource j = model.createResource(“http://example.org/Jordi”); Resource bcn = model.createResource(“http://example.org/Barcelona”); Property addrCity = model.createProperty(“ex”, “address-city”); // This automatically creates a Statement in the associated model. j.addProperty(addrCity, bcn);
  • 22. APACHE JENA: ARQ API Jena also provides an API called ARQ for programmatically executing SPARQL queries. To execute a given query on our example graph: String queryString = “...”; Query query = QueryFactory.create(queryString); // Associate a query execution context against our model. QueryExecution qe = QueryExecutionFactory.create(query, model); ResultSet rs = qe.execSelect(); // ResultSet acts like an Iterator. for (; rs.hasNext();) { QuerySolution qs = rs.nextSolution(); RDFNode r = qs.get(“fname”); // You can get a variable by name. // Do what you want with it. } // Always good to close resources when done. qe.close();
  • 23. APACHE JENA: TDB Jena’s native triplestore implementation is called TDB and consists of The node table stores resources, predicates (relationships), and literals maps nodes to internal node ids, and vice versa node ids are 8 bytes (64 bits) long The triple indexes stores 3 indexes into the node table The prefixes table maps prefixes to URIs TDB also supports ACID transactions using write-ahead logging. But no transaction is needed if there is only one single writer (even with multiple concurrent readers)
  • 24. RDF/SPARQL IN ACTION: DBPEDIA.ORG DBPedia describes itself as a “crowdsourced community effort to extract structured information from Wikipedia”  1.89 billion triples localized in 111 languages  English dataset contains 3.77 million topics Imagine if you can ask Wikipedia…  Which towns in Cataluña have a population between 10,000 and 50,000 people?  What are the birthdays of all blues guitarists who were born in Chicago?  (sample query from DBPedia.org wiki) Show me all soccer players who played as goalkeeper for a club that has a stadium with more than 40,000 seats and who are born in a country with more than 10 million inhabitants DBPedia also provides a SPARQL endpoint, so other websites can query its data and get results that are continuously updated DBPedia also contains geo-coordinates obtained from other sources (e.g. Geonames, Eurostat, CIA World Fact Book) – this opens the possibility for location-based applications from mobile devices
  • 25. CONCLUSIONS The Semantic Web – Web 3.0? RDF and SPARQL are key technologies in the W3C’s vision of the web of tomorrow Companies like Google, Tesco, and Best Buy already produce RDF content! Add some SPARQL to your projects! Source: w3.org
  • 26. BIBLIOGRAPHY Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web. http://www.scientificamerican.com/article.cfm?id=the-semantic-web W3 Consortium. (2004, February 10). RDF Primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ W3 Consortium. (2013, March 21). SPARQL 1.1 Query Language http://www.w3.org/TR/sparql11-query/ Sequeda, J. (2013, January 31) Introduction to Triplestores http://semanticweb.com/introduction-to-triplestores_b34996 Apache Jena http://jena.apache.org/ DBPedia http://dbpedia.org/

Editor's Notes

  1. A URI can be used to identify a resource by name or location (or both). If it specifies a location, it’s referred to as a URL. When used as a name, it’s referred to as an URN.
  2. Officially the W3C proposes RDF/XML as the syntax to use when serializing RDF graphs, but Turtle was found to be easier and manageable to edit. RDF/XML’s “lack of transparency and readability might have been a factor inhibiting rapid adoption of RDF” [Shadbolt, N; Hall, W; Berners-Lee, T, The Semantic Web Revisited, 2006] Turtle is related to two other notations for triples, N-Triples and N3, following this relation N-Triples  Turtle  N3 N-Triples is more minimalistic, while N3 can be used to express more than just RDF (http://en.wikipedia.org/wiki/Turtle_%28syntax%29)
  3. By default the statements in the WHERE clause are conjunctions (AND)
  4. Graph CREATE: creates an empty named graph Graph DROP: removes a named graph Graph COPY(g1, g2): overwrites the contents of g2 with the contents of g1 (similar to DROP g2 followed by INSERT ALL (g1, g2) – g1 is not modified by this operation) Graph MOVE(g1, g2): overwrites the contents of g2 with the contents of g1, then g1 is DROPped Graph ADD(g1, g2): inserts tuples from g1 into g2 – g1 is not modified by this operation
  5. Cypher is a de-facto standard, but is still mostly associated with Neo4J
  6. There are similarities, but Cypher has a lot of other features suitable for graph databases (e.g. find shortest path, find nodes that are n hops away from the start node, etc.) Note that most SPARQL queries expect to scan the graph for the result, while most Cypher queries typically specify a start node. This is not really an issue since specifying a start node is optional in Cypher anyway (http://docs.neo4j.org/chunked/milestone/query-start.html)
  7. Apache Jena provides its own native triplestore implementation as well as an API for leveraging relational stores (PostgreSQL, MySQL, Oracle, Microsoft SQL Server, etc) Sesame is a Java-based implementation maintained by openrdf.org
  8. Jena entered Apache incubation in November 2010
  9. The RDF API is part of Jena’s Core API jar
  10. The QueryExecutionFactory interface also has methods for binding a Query to a SPARQL (HTTP) endpoint, allowing applications to query remote triplestores
  11. The node table is stored as a sequential access file (for NodeId -> Node mappings) and a B+Tree (for Node -> NodeId) In write-ahead logging, changes to be made to the database are recorded in logs (in the form of redo and undo logs). In Jena, modifications made in a txn are written to a journal (a redo log), which is later committed to disk. Jena TDB has been tested to hold up to 1.7B triples (http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29)
  12. DBPedia uses OpenLink Virtuoso as its triplestore
  13. Other SW technologies: RDF Schema and Web Ontology Language (OWL) provide a richer set of semantics (vocabularies) for describing a group of related concepts: genealogies (e.g. isMother, hasChildren), application-specific class hierarchies (through rdfs:type, rdfs:subClassOf, etc), Rule Interchange Format (RIF) to facilitate the exchange of rules across different systems (rule engines) Trust and Provenance (how can we establish that an RDF source is trustworthy? Can you prove how derived (inferred) semantics were obtained?)