A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles

on

  • 383 views

The increasing adoption of Linked Data principles has led ...

The increasing adoption of Linked Data principles has led
to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.

Statistics

Views

Total Views
383
Views on SlideShare
288
Embed Views
95

Actions

Likes
1
Downloads
4
Comments
0

2 Embeds 95

http://linkedup-project.eu 81
https://twitter.com 14

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles Presentation Transcript

  • 1. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles Besnik Fetahu1 , Stefan Dietze1 , Bernardo Pereira Nunes2 , Marco Antonio Casanova2 , Davide Taibi3 , Wolfgang Nejdl1 1L3S Research Center, Leibniz Universit¨at Hannover 2Department of Informatics - PUC-Rio 3Institute for Educational Technologies, CNR May 29, 2014
  • 2. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 3. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 4. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data
  • 5. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains
  • 6. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets
  • 7. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets
  • 8. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis
  • 9. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis • Efficiency heavily dependent on information need
  • 10. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis • Efficiency heavily dependent on information need • Ease of access and representation of datasets
  • 11. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 12. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? • Growing number of datasets: 227 datasets • Data represented as triples: 31 billion triples • Multi-lingual content: 18 languages • Broad set of topics covered • Inter-dataset links Domain # Data. Triples Media 25 1,841,852,061 Geographic 31 6,145,532,484 Government 49 13,315,009,400 Publications 87 2,950,720,693 Cross-domain 41 4,184,635,715 Life sciences 41 3,036,336,004 User-generated 20 134,127,413
  • 13. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information.
  • 14. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information. • Scalability: Use SPARQL filter clause? • regex(*) filter clause needs to check all triples that contain a specific keyword.
  • 15. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information. • Scalability: Use SPARQL filter clause? • regex(*) filter clause needs to check all triples that contain a specific keyword. • Disambiguity: What are all the possible forms of renewable energy? • solar energy, wind energy, geothermal. . .
  • 16. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 17. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction
  • 18. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling
  • 19. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction
  • 20. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction 4 Profile graphs
  • 21. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction 4 Profile graphs 5 Profiles representation
  • 22. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 23. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 24. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 25. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 26. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Instance and Type Extraction • Simple SPARQL SELECT queries • Avg. indexing time 10% (7min) vs. 100% (4hrs). • Approximately ∼300 million resource instances 10 100 1000 10000 100000 uriburner bluk-bnb bio2rdf-kegg-pathway nom enclator-asturias b3katlobid-resources twc-ieeevis educationalprogram ssisvu farm bio-chem bl world-bank-linked-data event-m edia eea eunishungarian-national-library-catalog bio2rdf-pubm ed linked-user-feedback oecd-linked-data bio2rdf-goa pscs-catalogue bio2rdf-genbank linkedm db bfs-linked-data bio2rdf-reactom e british-m useum -collection bio2rdf-ncbigene datos-bcn-cl l3s-dblp bio2rdf-sgd hellenic-fire-brigade Log-scaleindexingtime 100% 10%
  • 27. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis 1 DBpedia Spotlight
  • 28. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis • weighted: weigh a resource by the number of datatype properties used to describe it wk = |f (rk)|/max{|f (rj )|} 1 DBpedia Spotlight
  • 29. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis • weighted: weigh a resource by the number of datatype properties used to describe it wk = |f (rk)|/max{|f (rj )|} • centrality: weigh a resource by the number of types used to describe it ck = |Ck|/|C| Topic Extraction • Resources as documents by combining all textual literals • Perform NED1 and extract corresponding DBpedia entities • Extract topics as DBpedia categories from entities via dcterms:subject 1 DBpedia Spotlight
  • 30. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Constructing profiles: Dataset-topic graph 1 Profile graph nodes: datasets, resources, topics 2 Weighted graph edges: ∆ D, t 3 Edge weights: ∆ Di , t = ∆ Dj , t 4 Compute ∆ Di , t by assessing the importance of t given the resources of Di as prior knowledge 5 The given prior knowledge biases the importance of t in the profile graph towards Di 2 6 Incrementally add datasets in the profile graph, by simply computing the weights ∆ Dk , t 2 Scott White and Padhraic Smyth. 2003. Algorithms for estimating relative importance in networks. In 9th ACM SIGKDD (KDD ’03).
  • 31. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Topic Ranking Approaches Topic filtering Topic pre-filtering: NTR(t, D) = Φ(·, D) Φ(t, D) + Φ(·, ·) Φ(t, ·) • Filter noisy topics • φ(·, ·) - number of entities associated with topic t • Closely related to the tf-idf weighting scheme Topic Ranking • PageRank with Priors (PRankP) • HITS with Priors (HITSP) • K-Step Markov (KStepM)
  • 32. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 33. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Experimental Setup Datasets and Ground-truth • 129 dataset from lod-cloud3 • 6 ground-truth datasets with manually assigned topic indicators for their resources Dataset Properties #Resources yovisto skos:subject, dbp:{subject, class, discipline, kategorie, tagline} 62879 oxpoints dcterms:subject,dc:subject 37258 socialsemweb-thesaurus skos:subject, tag:associatedTag, dcterms:subject 2243 semantic-web-dog-food dcterms:subject, dc:subject 20145 lak-dataset dcterms:subject, dc:subject 1691 Evaluation Metrics • NDCG@k (k=1, . . . , 1000) • Compare the induced ranking by the graphical models against the ideal ranking 3 At the time of experimentation only 129 dataset endpoints were responsive.
  • 34. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Baselines • tf-idf: Consider resources as documents. Extract for each dataset the top {50, 100, 150, 200} terms. • LDA: Consider dataset as documents4. Extract top weighted topic terms. For every dataset extract top {50, 100, 150, 200} with a number of topics {10, 20, 30, 40, 50}. 4 In this case it does not matter if datasets are considered at the resource level or aggregated.
  • 35. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 36. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Efficiency of Dataset Profiling 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 100 200 300 400 500 600 700 800 900 1000 NDCGrankingscore NDCG rank Profiling accuracy for all topic ranking approaches K-Step Markov + NTR PageRank with priors + NTR HITS with priors + NTR LDA tf-idf 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Sample Size K-Step Markov profiling accuracy (Centrality Sampling) KStepM + NTR KStepM
  • 37. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Scalability of Dataset Profiling 100 1000 10000 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 0.3 Log-scaletimeperformance NDCGrankingscore Sample Size Time Performance vs. Profiling Accuracy HITS with priors time HITS with priors ranking K-Step Markov time K-Step Markov ranking PageRank with priors time PageRank with priors ranking • 5% and 10% already provide stable profiling accuracy • Avg. 7mins for indexing 10% of resources per dataset vs. 4hrs per dataset • 2mins for ranking dataset profiles with 10% of resources vs. 45mins for 100% • NED runtime 10% vs. 100%?
  • 38. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Motivation Example Revisited!
  • 39. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Conclusions and Future Work • Structured dataset profiles • Scalable approach through sampling • Efficient profiling through topic filtering and ranking • Incremental generation of dataset profiles • Dataset profiles as a set of links (entity and topic links) • Provenance information of links (e.g. resources from which an entity is extracted) • Profiles for dataset recommendation, search, etc. Resources • Profiles Endpoint: http://data-observatory.org/lod-profiles/sparql • Profiles Webpage: http://data-observatory.org/lod-profiles/
  • 40. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Thank you! Questions? #eswc2014Fetahu