3. THE SEMANTIC WEB
Most of the data in the web consists of unstructured or
semi-structured data
HTML documents
Multimedia: images, video streams, audio files
Meant to read and processed by humans
What if we can structure and add metadata to this “Web
of Documents”, and make them understandable by
machines?
Metadata → meaning, or semantics
Machines can perform new tasks that used to require human
intervention
This is the motivation behind the Semantic Web!
The term “Semantic Web” was initially coined by Tim Berners-Lee: “a
web of data that can be processed directly and indirectly by
machines.”
4. THE SEMANTIC WEB
“The Semantic Web is a web of data…[it] provides a common
framework that allows data to be shared and reused across
application, enterprise, and community boundaries.”
[w3.org]
For the Semantic Web to happen, we would need
1. A way to structure and link data in a standardized way
2. A way to describe the relationships of these data in a
common way
3. A way to query that linked data
4. A way to infer something from that linked data (by
applying a set of rules)
but we will only focus on #1 and #3
5. RDF: A WAY TO STRUCTURE AND
LINK DATA
RDF = Resource Description Framework, a standard way
for applications to represent information that can then
be shared and processed
A resource can be anything that is identifiable: a user, a
coffee cup, a picture of your cat, a bank statement
RDF provides a way to model data by breaking it down
into three components:
The subject
The object
The predicate (aka the property).
6. RDF AS A GRAPH
Consider the following statement: Jordi lives in Barcelona
Subject: Jordi
Object: Barcelona
Predicate: lives-in (or, to be more precise, address-city)
RDFs are typically represented as a labeled directed
graph:
The arrow points from the subject to the object
Jordi Barcelo
na
address-
city
7. RDFS AND URIS
Resources must be identifiable, and RDF uses Uniform
Resource Identifier (URI) references.
E.g. Jordi = http://example.org/Jordi
URIs <> URLs!!!
RDF graphs are typically shown with the URIs for the subject, object,
and predicate:
The RDF graph can also be rewritten in text as:
<http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> .
As you may have guessed, RDF is more machine-friendly than
human-friendly!
http://...Jord
i
http://.../Barcel
ona
http://.../address
-city
8. RDF: RESOURCES AND
LITERALS
The object of a triple in RDF can either be a resource
(identified by URIs) or a literal (values such as strings and
numbers):
We can represent the RDF graph above as text as:
<http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> .
<http://example.org/Jordi> <http://example.org/firstname> “Jordi” .
<http://example.org/Jordi> <http://example.org/age> “37” .
This textual representation is also known as Terse RDF Triple
Language, or Turtle for short.
http://...Jord
i
http://.../Barcel
ona
http://...address-
city
“Jordi” 37
http://...agehttp://...firstna
me
9. RDF: PREFIXES
Prefixes can be used to simplify representations, either in
graphs:
prefix ex: http://example.org
or in Turtle:
@prefix ex:<http://example.org/> .
ex:Jordi ex:address-city ex:Barcelona .
ex:Jordi ex:firstname “Jordi” .
ex:Jordi ex:age “37” .
Now that we have a way to structure and link our data, we
want to be able to query it for information.
ex:Jordi ex:Barcelona
ex:address-city
“Jordi” 37
ex:ageex:firstname
10. SPARQL: A WAY TO QUERY
LINKED DATA
SPARQL = SPARQL Protocol and RDF Query Language
SPARQL 1.1 became a W3C Recommendation on March
2013!
Example: given our RDF graph, show all users who live in
Barcelona:
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
}
11. SPARQL AND GRAPH
PATTERNS
The statements in the WHERE clause form a graph
pattern, which is matched against subgraphs in the RDF
graph to form the solution.
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
}
ex:Jordi
ex:Barcelon
a
ex:address-city
“Jordi
”
37
ex:ageex:firstna
me
ex:Badalon
a
ex:Josep
ex:address-city
12. SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
FILTERs, ORDER BYs, LIMITs, and OFFSETs:
Show the names of users who live in Barcelona and are less
than 40 years old, starting from the 11th to 40th user:
PREFIX ex: <http://example.com/>
SELECT ?lname ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
?user ex:lastname ?lname .
?user ex:age ?age
FILTER (?age < 40)
}
ORDER BY ?lname
LIMIT 30
OFFSET 10
13. SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
Alternative matches using UNION, for those cases
where resources in the expected result set may match
multiple patterns:
Show the first names of users who live in Barcelona or
in Badalona:
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:firstname ?fname .
{
{ ?user ex:address-city ex:Barcelona . }
UNION
{ ?user ex:address-city ex:Badalona . }
}
}
14. SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
OPTIONAL matches, for those cases where not all
resources in the expected result set do not have to match a
pattern:
Show the first names of users who live in Barcelona and
their profile pic image, if they have one:
PREFIX ex: <http://example.com/>
SELECT ?fname ?ppic
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
OPTIONAL {
?user ex:ppic ?ppic .
}
}
15. SPARQL: THE SELECT
OPERATION
SPARQL SELECT operations also support:
Set inclusion (IN/NOT IN)
GROUP BY, HAVING, and aggregate functions such
as COUNT and AVG (new in SPARQL 1.1)
Subqueries (new in SPARQL 1.1)
16. SPARQL: OTHER OPERATIONS
Aside from SELECTs for querying, SPARQL also has
CONSTRUCT – creates a single RDF graph from the
result of a query by combining (i.e. applying set union
on) the resulting triples
ASK – returns a Boolean that indicates whether the
query is resolvable or not
DESCRIBE – returns an RDF graph that describes the
result (as determined by the query service)
INSERT/DELETE – adds or removes triples from the
graph (new in SPARQL 1.1)
Graph management operations (CREATE, DROP, COPY,
MOVE, ADD) (new in SPARQL 1.1)
17. TRIPLESTORES
The statements in an RDF graph (subject-predicate-object) are also
known as triples, and the specialized database used for storing
them are called triplestores.
Triplestores vs Graph Databases – What’s the diff?
Triplestores are especially designed to store RDF graphs, which
are labeled directed graphs
On the other hand, graph databases can store any kind of graph
(unlabeled, undirected, weighted, etc.)
Graph databases don’t have a standard query language (Cypher?)
Triplestores must support SPARQL
Triplestores are optimized for graph pattern matching, and may
lack the full capabilities of graph DBs
But graph databases can be used to implement a triplestore
(see Sequeda, J. (2013, January 31) Introduction to
Triplestores)
18. SPARQL AND CYPHER
SPARQL:
PREFIX ex: <http://example.com/>
SELECT ?fname
FROM <users.rdf>
WHERE {
?user ex:address-city ex:Barcelona .
?user ex:firstname ?fname .
}
Cypher:
MATCH user–[:ex_firstname]->fname,
user-[:ex_address-city]->city
WHERE city.uri = “ex:Barcelona”
RETURN fname
ex:Jordi
ex:Barcelon
a
ex:address-city
“Jordi
”
37
ex:ageex:firstna
me
20. APACHE JENA
Born in HP Labs in 2000, became a top-level Apache
project in April 2012
The Jena Framework includes
A Java API for working with RDF models
A SPARQL query processor
An efficient disk-based native triplestore
A rule-based inference engine that can be used with
RDF-based ontologies
A server for accepting SPARQL queries over HTTP (a
SPARQL endpoint)
21. APACHE JENA: RDF API
The Statement interface represents triples, while the Model
interface represents the whole RDF graph
Given a Statement, one could invoke
getSubject(), which would return a Resource
getPredicate(), which would return a Property
getObject(), which would return an RDFNode (which can be a
Resource or a Literal)
To create our example basic RDF graph:
Model model = ModelFactory.createDefaultModel();
Resource j = model.createResource(“http://example.org/Jordi”);
Resource bcn = model.createResource(“http://example.org/Barcelona”);
Property addrCity = model.createProperty(“ex”, “address-city”);
// This automatically creates a Statement in the associated model.
j.addProperty(addrCity, bcn);
22. APACHE JENA: ARQ API
Jena also provides an API called ARQ for
programmatically executing SPARQL queries.
To execute a given query on our example graph:
String queryString = “...”;
Query query = QueryFactory.create(queryString);
// Associate a query execution context against our model.
QueryExecution qe = QueryExecutionFactory.create(query, model);
ResultSet rs = qe.execSelect();
// ResultSet acts like an Iterator.
for (; rs.hasNext();)
{
QuerySolution qs = rs.nextSolution();
RDFNode r = qs.get(“fname”); // You can get a variable by name.
// Do what you want with it.
}
// Always good to close resources when done.
qe.close();
23. APACHE JENA: TDB
Jena’s native triplestore implementation is called TDB and
consists of
The node table
stores resources, predicates (relationships), and literals
maps nodes to internal node ids, and vice versa
node ids are 8 bytes (64 bits) long
The triple indexes
stores 3 indexes into the node table
The prefixes table
maps prefixes to URIs
TDB also supports ACID transactions using write-ahead
logging.
But no transaction is needed if there is only one single
writer (even with multiple concurrent readers)
24. RDF/SPARQL IN ACTION:
DBPEDIA.ORG
DBPedia describes itself as a “crowdsourced community
effort to extract structured information from Wikipedia”
1.89 billion triples localized in 111 languages
English dataset contains 3.77 million topics
Imagine if you can ask Wikipedia…
Which towns in Cataluña have a population between 10,000 and 50,000
people?
What are the birthdays of all blues guitarists who were born in Chicago?
(sample query from DBPedia.org wiki) Show me all soccer players who
played as goalkeeper for a club that has a stadium with more than 40,000
seats and who are born in a country with more than 10 million inhabitants
DBPedia also provides a SPARQL endpoint, so other websites
can query its data and get results that are continuously
updated
DBPedia also contains geo-coordinates obtained from other
sources (e.g. Geonames, Eurostat, CIA World Fact Book) –
this opens the possibility for location-based applications
from mobile devices
25. CONCLUSIONS
The Semantic Web – Web 3.0?
RDF and SPARQL are key
technologies in the W3C’s vision
of the web of tomorrow
Companies like Google, Tesco,
and Best Buy already produce
RDF content!
Add some SPARQL to your
projects!
Source:
w3.org
26. BIBLIOGRAPHY
Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web.
http://www.scientificamerican.com/article.cfm?id=the-semantic-web
W3 Consortium. (2004, February 10). RDF Primer.
http://www.w3.org/TR/2004/REC-rdf-primer-20040210/
W3 Consortium. (2013, March 21). SPARQL 1.1 Query Language
http://www.w3.org/TR/sparql11-query/
Sequeda, J. (2013, January 31) Introduction to Triplestores
http://semanticweb.com/introduction-to-triplestores_b34996
Apache Jena
http://jena.apache.org/
DBPedia
http://dbpedia.org/
Editor's Notes
A URI can be used to identify a resource by name or location (or both). If it specifies a location, it’s referred to as a URL. When used as a name, it’s referred to as an URN.
Officially the W3C proposes RDF/XML as the syntax to use when serializing RDF graphs, but Turtle was found to be easier and manageable to edit. RDF/XML’s “lack of transparency and readability might have been a factor inhibiting rapid adoption of RDF” [Shadbolt, N; Hall, W; Berners-Lee, T, The Semantic Web Revisited, 2006]
Turtle is related to two other notations for triples, N-Triples and N3, following this relation
N-Triples Turtle N3
N-Triples is more minimalistic, while N3 can be used to express more than just RDF (http://en.wikipedia.org/wiki/Turtle_%28syntax%29)
By default the statements in the WHERE clause are conjunctions (AND)
Graph CREATE: creates an empty named graph
Graph DROP: removes a named graph
Graph COPY(g1, g2): overwrites the contents of g2 with the contents of g1 (similar to DROP g2 followed by INSERT ALL (g1, g2) – g1 is not modified by this operation)
Graph MOVE(g1, g2): overwrites the contents of g2 with the contents of g1, then g1 is DROPped
Graph ADD(g1, g2): inserts tuples from g1 into g2 – g1 is not modified by this operation
Cypher is a de-facto standard, but is still mostly associated with Neo4J
There are similarities, but Cypher has a lot of other features suitable for graph databases (e.g. find shortest path, find nodes that are n hops away from the start node, etc.)
Note that most SPARQL queries expect to scan the graph for the result, while most Cypher queries typically specify a start node. This is not really an issue since specifying a start node is optional in Cypher anyway (http://docs.neo4j.org/chunked/milestone/query-start.html)
Apache Jena provides its own native triplestore implementation as well as an API for leveraging relational stores (PostgreSQL, MySQL, Oracle, Microsoft SQL Server, etc)
Sesame is a Java-based implementation maintained by openrdf.org
Jena entered Apache incubation in November 2010
The RDF API is part of Jena’s Core API jar
The QueryExecutionFactory interface also has methods for binding a Query to a SPARQL (HTTP) endpoint, allowing applications to query remote triplestores
The node table is stored as a sequential access file (for NodeId -> Node mappings) and a B+Tree (for Node -> NodeId)
In write-ahead logging, changes to be made to the database are recorded in logs (in the form of redo and undo logs). In Jena, modifications made in a txn are written to a journal (a redo log), which is later committed to disk.
Jena TDB has been tested to hold up to 1.7B triples (http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29)
DBPedia uses OpenLink Virtuoso as its triplestore
Other SW technologies:
RDF Schema and Web Ontology Language (OWL) provide a richer set of semantics (vocabularies) for describing a group of related concepts: genealogies (e.g. isMother, hasChildren), application-specific class hierarchies (through rdfs:type, rdfs:subClassOf, etc),
Rule Interchange Format (RIF) to facilitate the exchange of rules across different systems (rule engines)
Trust and Provenance (how can we establish that an RDF source is trustworthy? Can you prove how derived (inferred) semantics were obtained?)