Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles
Besnik Fetahu1
, Stefan Dietze1
, Bernardo Pereira Nunes2
, Marco
Antonio Casanova2
, Davide Taibi3
, Wolfgang Nejdl1
1L3S Research Center, Leibniz Universit¨at Hannover
2Department of Informatics - PUC-Rio
3Institute for Educational Technologies, CNR
May 29, 2014
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
• Growing number of datasets: 227 datasets
• Data represented as triples: 31 billion triples
• Multi-lingual content: 18 languages
• Broad set of topics covered
• Inter-dataset links
Domain # Data. Triples
Media 25 1,841,852,061
Geographic 31 6,145,532,484
Government 49 13,315,009,400
Publications 87 2,950,720,693
Cross-domain 41 4,184,635,715
Life sciences 41 3,036,336,004
User-generated 20 134,127,413
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?
• 38 out of 228 datasets contain
topic coverage information.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?
• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?
• regex(*) filter clause needs
to check all triples that contain
a specific keyword.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?
• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?
• regex(*) filter clause needs
to check all triples that contain
a specific keyword.
• Disambiguity: What are all the possible forms of
renewable energy?
• solar energy, wind energy, geothermal. . .
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Instance and Type Extraction
• Simple SPARQL SELECT queries
• Avg. indexing time 10% (7min) vs. 100% (4hrs).
• Approximately ∼300 million resource instances
10
100
1000
10000
100000 uriburner
bluk-bnb
bio2rdf-kegg-pathway
nom
enclator-asturias
b3katlobid-resources
twc-ieeevis
educationalprogram
ssisvu
farm
bio-chem
bl
world-bank-linked-data
event-m
edia
eea
eunishungarian-national-library-catalog
bio2rdf-pubm
ed
linked-user-feedback
oecd-linked-data
bio2rdf-goa
pscs-catalogue
bio2rdf-genbank
linkedm
db
bfs-linked-data
bio2rdf-reactom
e
british-m
useum
-collection
bio2rdf-ncbigene
datos-bcn-cl
l3s-dblp
bio2rdf-sgd
hellenic-fire-brigade
Log-scaleindexingtime
100%
10%
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Sampling Approaches
Entity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
1
DBpedia Spotlight
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Sampling Approaches
Entity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatype
properties used to describe it wk = |f (rk)|/max{|f (rj )|}
1
DBpedia Spotlight
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Sampling Approaches
Entity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatype
properties used to describe it wk = |f (rk)|/max{|f (rj )|}
• centrality: weigh a resource by the number of types
used to describe it ck = |Ck|/|C|
Topic Extraction
• Resources as documents by combining all textual literals
• Perform NED1 and extract corresponding DBpedia entities
• Extract topics as DBpedia categories from entities via
dcterms:subject
1
DBpedia Spotlight
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Constructing profiles: Dataset-topic graph
1 Profile graph nodes: datasets,
resources, topics
2 Weighted graph edges: ∆ D, t
3 Edge weights: ∆ Di , t = ∆ Dj , t
4 Compute ∆ Di , t by assessing the
importance of t given the resources
of Di as prior knowledge
5 The given prior knowledge biases
the importance of t in the profile
graph towards Di
2
6 Incrementally add datasets in the
profile graph, by simply computing
the weights ∆ Dk , t
2
Scott White and Padhraic Smyth. 2003. Algorithms for estimating relative
importance in networks. In 9th ACM SIGKDD (KDD ’03).
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Topic Ranking Approaches
Topic filtering
Topic pre-filtering:
NTR(t, D) =
Φ(·, D)
Φ(t, D)
+
Φ(·, ·)
Φ(t, ·)
• Filter noisy topics
• φ(·, ·) - number of entities
associated with topic t
• Closely related to the tf-idf
weighting scheme
Topic Ranking
• PageRank with Priors (PRankP)
• HITS with Priors (HITSP)
• K-Step Markov (KStepM)
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Experimental Setup
Datasets and Ground-truth
• 129 dataset from lod-cloud3
• 6 ground-truth datasets with manually assigned topic
indicators for their resources
Dataset Properties #Resources
yovisto
skos:subject, dbp:{subject, class,
discipline, kategorie, tagline}
62879
oxpoints dcterms:subject,dc:subject 37258
socialsemweb-thesaurus
skos:subject, tag:associatedTag,
dcterms:subject
2243
semantic-web-dog-food dcterms:subject, dc:subject 20145
lak-dataset dcterms:subject, dc:subject 1691
Evaluation Metrics
• NDCG@k (k=1, . . . , 1000)
• Compare the induced ranking by the graphical models
against the ideal ranking
3
At the time of experimentation only 129 dataset endpoints were
responsive.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Baselines
• tf-idf: Consider resources as documents. Extract for
each dataset the top {50, 100, 150, 200} terms.
• LDA: Consider dataset as documents4. Extract top
weighted topic terms. For every dataset extract top {50,
100, 150, 200} with a number of topics {10, 20, 30, 40,
50}.
4
In this case it does not matter if datasets are considered at the
resource level or aggregated.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Efficiency of Dataset Profiling
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 100 200 300 400 500 600 700 800 900 1000
NDCGrankingscore
NDCG rank
Profiling accuracy for all topic ranking approaches
K-Step Markov + NTR
PageRank with priors + NTR
HITS with priors + NTR
LDA
tf-idf
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Sample Size
K-Step Markov profiling accuracy (Centrality Sampling)
KStepM + NTR
KStepM
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Scalability of Dataset Profiling
100
1000
10000
0 20 40 60 80 100
0
0.05
0.1
0.15
0.2
0.25
0.3
Log-scaletimeperformance
NDCGrankingscore
Sample Size
Time Performance vs. Profiling Accuracy
HITS with priors time
HITS with priors ranking
K-Step Markov time
K-Step Markov ranking
PageRank with priors time
PageRank with priors ranking
• 5% and 10% already provide
stable profiling accuracy
• Avg. 7mins for indexing 10%
of resources per dataset vs.
4hrs per dataset
• 2mins for ranking dataset
profiles with 10% of resources
vs. 45mins for 100%
• NED runtime 10% vs. 100%?
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Motivation Example Revisited!
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Conclusions and Future Work
• Structured dataset profiles
• Scalable approach through sampling
• Efficient profiling through topic filtering and ranking
• Incremental generation of dataset profiles
• Dataset profiles as a set of links (entity and topic links)
• Provenance information of links (e.g. resources from
which an entity is extracted)
• Profiles for dataset recommendation, search, etc.
Resources
• Profiles Endpoint:
http://data-observatory.org/lod-profiles/sparql
• Profiles Webpage:
http://data-observatory.org/lod-profiles/
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Thank you! Questions?
#eswc2014Fetahu

A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles

  • 1.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles Besnik Fetahu1 , Stefan Dietze1 , Bernardo Pereira Nunes2 , Marco Antonio Casanova2 , Davide Taibi3 , Wolfgang Nejdl1 1L3S Research Center, Leibniz Universit¨at Hannover 2Department of Informatics - PUC-Rio 3Institute for Educational Technologies, CNR May 29, 2014
  • 2.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 3.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 4.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data
  • 5.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains
  • 6.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets
  • 7.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets
  • 8.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis
  • 9.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis • Efficiency heavily dependent on information need
  • 10.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis • Efficiency heavily dependent on information need • Ease of access and representation of datasets
  • 11.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 12.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? • Growing number of datasets: 227 datasets • Data represented as triples: 31 billion triples • Multi-lingual content: 18 languages • Broad set of topics covered • Inter-dataset links Domain # Data. Triples Media 25 1,841,852,061 Geographic 31 6,145,532,484 Government 49 13,315,009,400 Publications 87 2,950,720,693 Cross-domain 41 4,184,635,715 Life sciences 41 3,036,336,004 User-generated 20 134,127,413
  • 13.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information.
  • 14.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information. • Scalability: Use SPARQL filter clause? • regex(*) filter clause needs to check all triples that contain a specific keyword.
  • 15.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information. • Scalability: Use SPARQL filter clause? • regex(*) filter clause needs to check all triples that contain a specific keyword. • Disambiguity: What are all the possible forms of renewable energy? • solar energy, wind energy, geothermal. . .
  • 16.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 17.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction
  • 18.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling
  • 19.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction
  • 20.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction 4 Profile graphs
  • 21.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction 4 Profile graphs 5 Profiles representation
  • 22.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 23.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 24.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 25.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 26.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Instance and Type Extraction • Simple SPARQL SELECT queries • Avg. indexing time 10% (7min) vs. 100% (4hrs). • Approximately ∼300 million resource instances 10 100 1000 10000 100000 uriburner bluk-bnb bio2rdf-kegg-pathway nom enclator-asturias b3katlobid-resources twc-ieeevis educationalprogram ssisvu farm bio-chem bl world-bank-linked-data event-m edia eea eunishungarian-national-library-catalog bio2rdf-pubm ed linked-user-feedback oecd-linked-data bio2rdf-goa pscs-catalogue bio2rdf-genbank linkedm db bfs-linked-data bio2rdf-reactom e british-m useum -collection bio2rdf-ncbigene datos-bcn-cl l3s-dblp bio2rdf-sgd hellenic-fire-brigade Log-scaleindexingtime 100% 10%
  • 27.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis 1 DBpedia Spotlight
  • 28.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis • weighted: weigh a resource by the number of datatype properties used to describe it wk = |f (rk)|/max{|f (rj )|} 1 DBpedia Spotlight
  • 29.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis • weighted: weigh a resource by the number of datatype properties used to describe it wk = |f (rk)|/max{|f (rj )|} • centrality: weigh a resource by the number of types used to describe it ck = |Ck|/|C| Topic Extraction • Resources as documents by combining all textual literals • Perform NED1 and extract corresponding DBpedia entities • Extract topics as DBpedia categories from entities via dcterms:subject 1 DBpedia Spotlight
  • 30.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Constructing profiles: Dataset-topic graph 1 Profile graph nodes: datasets, resources, topics 2 Weighted graph edges: ∆ D, t 3 Edge weights: ∆ Di , t = ∆ Dj , t 4 Compute ∆ Di , t by assessing the importance of t given the resources of Di as prior knowledge 5 The given prior knowledge biases the importance of t in the profile graph towards Di 2 6 Incrementally add datasets in the profile graph, by simply computing the weights ∆ Dk , t 2 Scott White and Padhraic Smyth. 2003. Algorithms for estimating relative importance in networks. In 9th ACM SIGKDD (KDD ’03).
  • 31.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Topic Ranking Approaches Topic filtering Topic pre-filtering: NTR(t, D) = Φ(·, D) Φ(t, D) + Φ(·, ·) Φ(t, ·) • Filter noisy topics • φ(·, ·) - number of entities associated with topic t • Closely related to the tf-idf weighting scheme Topic Ranking • PageRank with Priors (PRankP) • HITS with Priors (HITSP) • K-Step Markov (KStepM)
  • 32.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 33.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Experimental Setup Datasets and Ground-truth • 129 dataset from lod-cloud3 • 6 ground-truth datasets with manually assigned topic indicators for their resources Dataset Properties #Resources yovisto skos:subject, dbp:{subject, class, discipline, kategorie, tagline} 62879 oxpoints dcterms:subject,dc:subject 37258 socialsemweb-thesaurus skos:subject, tag:associatedTag, dcterms:subject 2243 semantic-web-dog-food dcterms:subject, dc:subject 20145 lak-dataset dcterms:subject, dc:subject 1691 Evaluation Metrics • NDCG@k (k=1, . . . , 1000) • Compare the induced ranking by the graphical models against the ideal ranking 3 At the time of experimentation only 129 dataset endpoints were responsive.
  • 34.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Baselines • tf-idf: Consider resources as documents. Extract for each dataset the top {50, 100, 150, 200} terms. • LDA: Consider dataset as documents4. Extract top weighted topic terms. For every dataset extract top {50, 100, 150, 200} with a number of topics {10, 20, 30, 40, 50}. 4 In this case it does not matter if datasets are considered at the resource level or aggregated.
  • 35.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 36.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Efficiency of Dataset Profiling 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 100 200 300 400 500 600 700 800 900 1000 NDCGrankingscore NDCG rank Profiling accuracy for all topic ranking approaches K-Step Markov + NTR PageRank with priors + NTR HITS with priors + NTR LDA tf-idf 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Sample Size K-Step Markov profiling accuracy (Centrality Sampling) KStepM + NTR KStepM
  • 37.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Scalability of Dataset Profiling 100 1000 10000 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 0.3 Log-scaletimeperformance NDCGrankingscore Sample Size Time Performance vs. Profiling Accuracy HITS with priors time HITS with priors ranking K-Step Markov time K-Step Markov ranking PageRank with priors time PageRank with priors ranking • 5% and 10% already provide stable profiling accuracy • Avg. 7mins for indexing 10% of resources per dataset vs. 4hrs per dataset • 2mins for ranking dataset profiles with 10% of resources vs. 45mins for 100% • NED runtime 10% vs. 100%?
  • 38.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Motivation Example Revisited!
  • 39.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Conclusions and Future Work • Structured dataset profiles • Scalable approach through sampling • Efficient profiling through topic filtering and ranking • Incremental generation of dataset profiles • Dataset profiles as a set of links (entity and topic links) • Provenance information of links (e.g. resources from which an entity is extracted) • Profiles for dataset recommendation, search, etc. Resources • Profiles Endpoint: http://data-observatory.org/lod-profiles/sparql • Profiles Webpage: http://data-observatory.org/lod-profiles/
  • 40.
    Introduction Problem and Motivation Approach Resource Instance and TypeExtraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Thank you! Questions? #eswc2014Fetahu