Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Retrieval, Crawling and Fusion of
Entity-centric Data on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Ke...
Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Comput...
Acknowledgements: team
09/09/16 3Stefan Dietze
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Ujwal Gadiraju (L3S)
 Eelc...
Structured (linked) data on the Web: state of affairs
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Acce...
Data quality and consistency
Analyzing Relative Incompleteness of Movie Descriptions
in the Web of Data: A Case Study, Yua...
Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 09/09/16
??? ?? ?
Discovery of suitabl...
Overview
09/09/16Stefan Dietze 7
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Datas...
Overview
09/09/16Stefan Dietze 8
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Datas...
09/09/16
Dataset recommendation I
9
S
Linkset1
Linkset2
Approach
 Given dataset s, ranking datasets from D
according to p...
09/09/16
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking,...
09/09/16
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking,...
Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/ca...
09/09/16 13Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, fed...
09/09/16 14Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, fed...
db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
...
Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2....
Search & exploration of datasets through topic profiles
 Applied to entire LOD cloud/graph
 Visual exploration of extrac...
Search: entity retrieval on large structured datasets?
 How to efficiently retrieve (related) entities/resources for give...
Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and st...
Dataset
 BTC2014 (4 billion entities)
 92 SemSearch queries
Methods
 Our approaches: XM: Xmeans, SP: Spectral
 Baselin...
Overview
09/09/16Stefan Dietze 22
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Data...
 Linked Data: approx. 1000 datasets & 100 billion statements
- different order of magnitude wrt scale & dynamics
vs
 The...
 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 Arbitrary voca...
 schema:Product instances in Web Data Commons
 Facts: 1.414.937.431
(= 302.246.120 instances, i.e. products)
 Providers...
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Study on samp...
Example: entity markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocab...
09/09/16 28Stefan Dietze
Entity retrieval on Web markup: state of the art
 Glimmer
(http://glimmer.research.yahoo.com)
 ...
Entity retrieval on Web markup: challenges
09/09/16 29
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“...
 Obtaining consolidated entity description/facts (or graph) for a
given resource/entity from Web markup?
 Aiding tasks: ...
A supervised approach for data fusion on markup
09/09/16 31
 Fact/entity retrieval: BM25 entity retrieval model on markup...
Evaluation & results (1/2)
09/09/16 32Stefan Dietze
Evaluation setup
 Comparison with baselines:
 BM25: Top-k distinct f...
09/09/16 33Stefan Dietze
Evaluation & results (2/2): markup for KB augmentation?
 Comparison of obtained facts with exist...
Linked Data & knowledge graphs
Conclusions & outlook
09/09/16 34Stefan Dietze
 Retrieval/search of Linked Data hindered b...
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePubli...
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePubli...
Upcoming SlideShare
Loading in …5
×

Retrieval, Crawling and Fusion of Entity-centric Data on the Web

629 views

Published on

Keynote at 2nd KEYSTONE conference (IKC2016) 9 September, 2016

Published in: Technology
  • Be the first to comment

Retrieval, Crawling and Fusion of Entity-centric Data on the Web

  1. 1. Retrieval, Crawling and Fusion of Entity-centric Data on the Web Stefan Dietze L3S Research Center, Hannover, Germany - Keynote at 2nd International Keystone Conference, IKC2016 - 09/09/16 1Stefan Dietze
  2. 2. Research areas  Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility Some projects L3S Research Center 09/09/16 2  See also: http://www.l3s.de Stefan Dietze
  3. 3. Acknowledgements: team 09/09/16 3Stefan Dietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Ran Yu (L3S)  Pracheta Sahoo (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Mohamed Ben Ellefi (LIRMM, France)  Davide Taibi (CNR, Italy)  Konstantin Todorov (LIRMM, France)  ...
  4. 4. Structured (linked) data on the Web: state of affairs SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]  “THE” SPARQL protocol? No, but many variants & subsets Semantics, links, quality?  …data accuracy (eg DBpedia)? [Paulheim2013]  …vocabulary reuse? [D’AquinWebSci13]  …schema compliance (RDFS, schemas) [HoganJWS2012] Stefan Dietze Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 09/09/16 4 SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil- Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013).
  5. 5. Data quality and consistency Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., International Semantic Web Conference 2014 (ISWC2014) 09/09/16Stefan Dietze 5
  6. 6. Challenge for search/retrieval – heterogeneity of datasets & entities Stefan Dietze 09/09/16 ??? ?? ? Discovery of suitable (1) datasets & (2) entities matching:  Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?  Topics/scope? Datasets/entities useful & trustworthy for topic XY?  Types? Datasets/entities about statistics, organisations, videos, slides, publications etc? 6
  7. 7. Overview 09/09/16Stefan Dietze 7 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Dealing with diversity and heterogeneity
  8. 8. Overview 09/09/16Stefan Dietze 8 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Dealing with diversity and heterogeneity Other emerging forms of structured data on the Web?
  9. 9. 09/09/16 Dataset recommendation I 9 S Linkset1 Linkset2 Approach  Given dataset s, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Approach 1: vocabulary overlap  Approach 2: existing links (SNA)  Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions  Roughly 50% MAP for both approaches  Simplistic approach (!) Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A., Dietze, S., Two approaches to the dataset interlinking recommendation problem, 15th International Conference on Web Information System Engineering (WISE 2014), Thessaloniki, Greece. Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze 9 Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)
  10. 10. 09/09/16 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016 Stefan Dietze 10 Dataset recommendation II L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013. Preprocessing Datasets rankingDatasets filtering
  11. 11. 09/09/16 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016.ESWC2016 Stefan Dietze 11 Dataset recommendation II: results Data & ground truth  Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)  Concept profiles from http://lov.okfn.org  Ground truth: existing links from VOID profiles of datasets (issue: not always representative for actual linksets) Results  MAP for different similarity thresholds from step 2 max. 54%  Recall 100% below indicated similarity (clustering) thresholds
  12. 12. Dataset search through dataset cataloging & profiling Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)  LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)  Original datasets published with key content providers, automatically extracted metadata 09/09/16 12Stefan Dietze
  13. 13. 09/09/16 13Stefan Dietze http://data.linkededucation.org/linkedup/catalog/ LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling Schema/Types
  14. 14. 09/09/16 14Stefan Dietze http://data.linkededucation.org/linkedup/catalog/ LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling [ESWC14] Dataset topic profiles
  15. 15. db:Biology db:Cell biology Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Lecture 29 – Stem Cells</dc:title> … </yo:Video…> Yovisto Video  Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?  Technically trivial through established NER/NED approaches, but scalability issues (recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)  Efficient approach: sampling & ranking for balance between scalability and precision /recall Scalable profiling of datasets A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). db:Cell (Biology) 09/09/16 16 db:Cell (Biology) Stefan Dietze
  16. 16. Efficient dataset profiling 1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling) 2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion) 3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)  Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). 09/09/16 17Stefan Dietze
  17. 17. Search & exploration of datasets through topic profiles  Applied to entire LOD cloud/graph  Visual exploration of extracted RDF dataset profiles (datasets, topics, relationships)  Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets) http://data-observatory.org/lod-profiles/ 09/09/16 18Stefan Dietze
  18. 18. Search: entity retrieval on large structured datasets?  How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?  State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)  Challenges/observations:  Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods  Query type affinity? 09/09/16 19Stefan Dietze ?? Large dataset/crawl e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory entities related to <Tim Berners Lee> ? BTC2014 DyLDO
  19. 19. Entity retrieval: approach (I) Offline processing (clustering to address link sparsity) 1. Feature vectors (lexical and structural features) 2. Bucketing: per type (LSH algorithm) 3. Clustering: X-means & Spectral clustering per bucket Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). (II) Online processing (retrieval) 1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities) 2. Re-Ranking (context terms & query type affinity) 09/09/16 20Stefan Dietze
  20. 20. Dataset  BTC2014 (4 billion entities)  92 SemSearch queries Methods  Our approaches: XM: Xmeans, SP: Spectral  Baselines B: BM25F, S1: Tonon et al [SIGIR12] Conclusions  XM & SP outperform baselines  Clustering to remedy link sparsity  Relevance to query more important than relevance to BM25F results Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). Entity retrieval: evaluation 09/09/16 21Stefan Dietze
  21. 21. Overview 09/09/16Stefan Dietze 22 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Dealing with diversity and dynamics Other emerging forms of structured data on the Web?
  22. 22.  Linked Data: approx. 1000 datasets & 100 billion statements - different order of magnitude wrt scale & dynamics vs  The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google  Other „semantics“ (structured facts) on the Web? Semantics (structured data) on the Web? 09/09/16 23Stefan Dietze
  23. 23.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)  Adoption on the Web: 26 % (2014 Google study of 12 bn Web pages)  “Web Data Commons” (Meusel & Paulheim [ISWC2014]) • Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads • Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)  Same order of magnitude as “the Web” Embedded semantics: Web page markup & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 09/09/16 24 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze
  24. 24.  schema:Product instances in Web Data Commons  Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)  Providers (distinct Pay Level Domains, PLDs): 93.705  Power Law distribution of terms across PLDs  Top 10 PLDs  Top provider ? (company) 09/09/16 25Stefan Dietze Example: embedded Web markup data about „products“ PLD # Resources www.crateandbarrel.com 33.517.936,00 www.bentgate.com 17.215.499,00 www.aliexpress.com 9.621.943,00 www.ebay.com.au 8.861.308,00 us.fotolia.com 7.939.982,00 www.ebay.co.uk 6.556.820,00 www.competitivecyclist.com 6.214.500,00 www.maxstudio.com 6.075.626,00 approx. 35 million resources
  25. 25. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Study on sample Web crawl (WDC)  Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)  Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org Domains, topics, disciplines?  Life Sciences and Computer Science predominant  Top-10 article titles  Most important publishers/journals, libraries represented Example: Web markup of bibliographic resources 09/09/16 26Stefan Dietze Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., Analysing Structured Scholarly Data embedded in Web Pages, Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD2016), co- located with the 25th International World Wide Web Conference, Montreal, Canada, April 11, 2016
  26. 26. Example: entity markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources (informal, formal, etc)  Approx. 5000 PLDs in “Common Crawl”  LRMI-Adaptation on the Web (WDC) [LILE16]:  2014: 30.599.024 quads, 4.182.541 resources  2013: 10.636873 quads, 1.461.093 resources 09/09/16 27 Power law distribution across providers 4805 Provider / PLDs Taibi, D., Dietze, S., Towards embedded markup of learning resources on the Web: a quantitative Analysis of LRMI Terms Usage, in Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2 2016, Montreal, Canada, April 11, 2016 Stefan Dietze
  27. 27. 09/09/16 28Stefan Dietze Entity retrieval on Web markup: state of the art  Glimmer (http://glimmer.research.yahoo.com)  Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]  BM25F retrieval model on WDC index
  28. 28. Entity retrieval on Web markup: challenges 09/09/16 29 Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) in CommonCrawl Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC Lack of links Largely unlinked entity descriptions Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 %, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)  Using markup as (highly distributed) knowledge graph? Stefan Dietze A Survey on Challenges for Entity Retrieval in Markup Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th International Semantic Web Conference (ISWC2016), Kobe, Japan (2016).
  29. 29.  Obtaining consolidated entity description/facts (or graph) for a given resource/entity from Web markup?  Aiding tasks: such as document annotation, augmentation or semantic enrichment of existing data- or knowledge bases Entity retrieval & reconciliation on markup 09/09/16 30 Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entity summarisation on structured web markup. In The Semantic Web: ESWC 2016 Satellite Events. Springer, 2016. Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact Selection for data fusion on structured web markup. ICDE2017, IEEE International Conference on Data Engineering, in progress. Query iPhone 6, type:(Product) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB <e1, s:name, „Iphone 6“> <e2, s:brand, „Apple Inc.“> <e3, s:brand, „Apple“> <e4, s:weight, 127> <e5, s:releaseDate, „1.12.1972“> Web (crawl) (e.g. Common Crawl/WDC, focused crawl) Stefan Dietze
  30. 30. A supervised approach for data fusion on markup 09/09/16 31  Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)  Fact selection/data fusion: ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)  Experiments on Common Crawl: products, movies, books (approx. 3 billion facts) 1. Retrieval 2. Fact selection New Queries Foxconn, type:(Organization) Cupertino, type:(City) Apple Inc., type:(Organization) (supervised SVM classifier) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB Query iPhone 6, type:(Product) Candidate Facts node1 brand _node-x node1 brand Apple Inc. node1 weight 129 node2 weight 172 node2 manufacturer Foxconn node3 releasedate 01.12.1972 node3 manufacturer Foxconn Web page markup Web (crawl) approx. 125.000 facts for „iPhone6“ Stefan Dietze
  31. 31. Evaluation & results (1/2) 09/09/16 32Stefan Dietze Evaluation setup  Comparison with baselines:  BM25: Top-k distinct facts via BM25  CBFS: clustering/heuristics-based approach  Expert-labeled ground truth Results  Supervised learning approach (SumSVM, SumDIV) outperforms baselines  Strong variance of results across query sets (for baselines, not our approach)  Strongest performance considering all feature sets Precision results
  32. 32. 09/09/16 33Stefan Dietze Evaluation & results (2/2): markup for KB augmentation?  Comparison of obtained facts with existing knowledge bases (DBpedia) o „existing“: fact already in DBpedia o „new“: fact not existing in DBpedia (eg a book‘s releaseDate in Wiki/DBpedia) o „new-p“: property not existing in DBpedia (eg a book‘s release countries)  60-70% new facts for books & movies  100% new facts for queried products (not existing in DBpedia apparently)  Vast potential for KB augmentation (!)
  33. 33. Linked Data & knowledge graphs Conclusions & outlook 09/09/16 34Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search
  34. 34. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup Unstructured (Web) data/docs Linked Data & knowledge graphs Conclusions & outlook 09/09/16 35Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search  New forms of (structured) Web data: Web markup (schema.org et al) o Convergence of structured and unstructured Web o Scale and dynamics (!) o Potential to augment existing knowledge graphs o Potential training data for NED, entity interlinking and similar entity-centric problems
  35. 35. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup Unstructured (Web) data/docs Linked Data & knowledge graphs Thank you! 09/09/16 36Stefan Dietze ? http://stefandietze.net @stefandietze

×