First public presentation of CrawlerLD (we haven't choose the name at this moment =) ) made for the 16th International Conference on Enterprise Information Systems (ICEIS) which we won the best paper award in area Area: Software Agents and Internet Computing
1. A metadata focused
crawler for Linked Data
Raphael do Vale A. Gomes1, Marco A. Casanova1,
Giseli Rabello Lopes1 and Luiz André P. Paes Leme2
1 2
2. Outline
Introduction
Background
Use case
A metadata focused crawler
Tests and results
Conclusions and future work
Acknowledgments
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
2
3. Introduction
Linked Data principles
Use URIs as names for things
Use HTTP URIs so that people can look up those names
When someone looks up a URI, provide useful information,
using the standards (RDF*, SPARQL)
Include links to other URIs, so that they can discover more
things
Source: http://www.w3.org/DesignIssues/LinkedData.html
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
3
4. Introduction
How can we recommend linked data sources to a
beginner user?
Data sources may not use popular ontologies
There might be more than one ontology for the same
domain
The user may not know all (if any) of the ontologies
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
4
5. Introduction
Our solution:
Create a recommender system that receives a small set of
generic URI resources and returns a complete report of
related resources (URIs, Datasets and Ontologies)
Why generic? Because our user is a beginner person exploring the
Linked Data! He doesn’t have to know about specific datasets or
ontologies, he only need to know how to get started.
The recommender system would benefit from a Linked
Data crawler, based on metadata
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
5
6. Introduction
Metadata focused crawler
INPUT:
User should summarize the desired domain with a small set of related
terms (URI Resources)
OUTPUT:
The tool returns a list of vocabulary terms, as well as provenance
data indicating how the output was generated
With the output results, the user should evaluate the most
relevant vocabularies for triplification or linkage process
This step could be manual or use another tool (e.g.: recommender
system)
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
6
7. Background
Important properties
rdfs:subClassOf, owl:sameAs, rdfs:seeAlso and
rdf:type
SPARQL Queries
Similar to SQL
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
7
8. Use case
Scenario
User wants to publish a relational database as Linked
Data, storing music data
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
8
9. Use case
Input
The user defines an initial set T of terms to describe the
application domain
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
9
dbpedia:Music,
from DBpedia
Metadata
Focused Crawler
10. Use case
Process
The crawler focuses on finding new terms
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
10
Subclasses of the class, or
related terms (owl:sameAs
or rdfs:seeAlso)
Also counts the number
of instances of the
class found in each
dataset
Metadata
Focused Crawler
11. Use case
Output - The crawler will return:
1. List of the terms found, indicating their provenance
2. For each term found, an estimation of the number of instances in
Metadata
each tripleset probed
Focused Crawler
wordnet:synset-music-noun-
1 -> owl:sameAs
-> opencyc:Music ->
rdfs:subClassOf ->
opencyc:LoveSong ->
instance -> 500
instances.
...
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
11
12. A metadata focused crawler
Our solution:
Executes several SPARQL Queries over all the LOD Cloud
(Linked Open Data Cloud)
For each dataset, applies several queries trying to
discover relationships between datasets and the crawling
resource
A breath first algorithm is used to discover more data in cycles
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
12
13. A metadata focused crawler
Crawling terms
Elected terms to crawl
Initial crawling terms
The initial set of terms selected by the user
Crawling properties
The list of properties that will be used to crawl
Crawling frontier
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
13
14. A metadata focused crawler
Crawling queries
Each crawling query is applied to each dataset found
Each crawling property is crawled using one query
For each crawling term, all such queries are applied to all
datasets
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
14
15. A metadata focused crawler
Crawling queries
SPARQL Endpoint or RDF dump – inverted query
SELECT distinct ?item
WHERE { ?item p <t> }
Instance count
Similar to other queries, but only the result size is saved
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
15
16. A metadata focused crawler
Crawling stages
Challenge: based on generic terms, how can we
discover more data?
Answer: using strong relationships (sameAs,
subclassOf, seeAlso and instanceOf)
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
16
Schema.org
DBpedia WordNet
Music Ontology
BBC Music
More specific
17. A metadata focused crawler
Crawling stages
Each new resource found is saved for the next level of
crawling
Crawling frontier
All terms elected to be processed in the next cycle
Circular references are prevented
Parameters to prevent large processing times
Number of stages
Maximum numbers of terms probed
Maximum numbers of terms probed, for each term in the crawling
frontier
Maximum numbers of terms probed in each tripleset, for each term in
the crawling frontier
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
17
18. A metadata focused crawler
Crawling stages
Example
wordnet:synset-music-noun-1 -> owl:sameAs ->
OpenCyc:Music -> rdfs:subClassOf ->
OpenCyc:LoveSong -> instance -> 500 instances.
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
18
19. Tests and results
Domain:
Music
Term Instance Subclass SameAs SeeAlso
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
19
mo:MusicArtist 103,541 2 -- --
mo:MusicalWork 16,833 1 -- --
dbpedia:MusicalWork 145,656 5 from dbpedia
and 21,413 from
yago
2 12
dbpedia:Song 10,987 1 1 14 (half in
Japanese)
dbpedia:Album 100,090 3 plus over 17,222
from yago
3 and other
languages
--
dbpedia:MusicalArtist 49,973 2 plus 2,178 from
yago
2 1
dbpedia:Single 44,623 3,414 -- 9
20. Tests and results
Music domain
Tool Precision Recall
Metadata Focused Crawler 95% 91%
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
20
21. Lessons learned
Parameter setting
May grow exponentially
Choosing initial crawling terms
Music ontology is not interlinked with more popular data
sources
Linked Data principles not followed
Multiple ontologies describing the domain of
interest
The larger the number of data sources in the domain, the
more useful the results will be
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
21
22. Conclusions and future work
Improvements
Discovering relationships between resources of two
triplesets described by a third one
Crawling with SPARQL queries
Identifying resources in different languages
Performing simple deductions
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
22
23. Conclusions and future work
Improving input
Summarization techniques for automatic input generation
Accepting natural language keywords and converting
them to URI resources
Improving system performance
Caching
Better queries to provide results with less requests per
endpoint
Web interface
Open source
Recommender system
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
23
24. Acknowledgments
This work was partly supported by:
grants 160326/2012-5, 303332/2013-1
and 57128/2009-9
grants E-26/170028/2008 and E-
26/103.070/2011
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
24
25. A metadata focused
crawler for the Linked
Data
Raphael do Vale A. Gomes1, Marco A. Casanova1,
Giseli Rabello Lopes1 and Luiz André P. Paes Leme2
Contact: rgomes@inf.puc-rio.br
1 2