Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

1,215 views

Published on

Keynote at LDOW2017 (Linked Data on the Web 2017), at WWW2017, Perth, Australia, 03 April 2017

Published in: Technology
  • Be the first to comment

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

  1. 1. Beyond Linked Data – Exploiting Entity-Centric Knowledge on the Web Stefan Dietze L3S Research Center, Hannover, Germany - Linked Data on the Web (LDOW2017), WWW2017 - 05/04/17 1Stefan Dietze
  2. 2. Research areas  Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility, ... Some projects Research @ L3S 05/04/17 2  See also: http://www.l3s.de Stefan Dietze
  3. 3. Acknowledgements: team 05/04/17 3Stefan Dietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Elena Demidova (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Nicolas Tempelmeier (L3S)  Ran Yu (L3S)  Nilamadhaba Mohapatra (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Mohamed Ben Ellefi (LIRMM, France)  Davide Taibi (CNR, Italy)  Konstantin Todorov (LIRMM, France)  ...
  4. 4. Back in September 2016 05/04/17 4Stefan Dietze A new look at the semantic web. Abraham Bernstein, James Hendler, Natalya Noy, Communications of the ACM, Vol. 59 No. 9, Pages 35- 37, September 2016 Retrieval, Crawling and Fusion of Entity-centric Data on the Web, Dietze, S., in Semantic Keyword-Based Search on Structured Data Sources, In: Calì A., Gorgan D., Ugarte M. (eds) Semantic Keyword-Based Search on Structured Data Sources. KEYSTONE 2016. LNCS, Vol 10151. Springer, 2017.
  5. 5. Overview 05/04/17Stefan Dietze 6 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of semantics/structured data on the Web („Future“) Dealing with heterogeneity & shortcomings („Present“)
  6. 6. Data accessibility & quality? SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of (linked) datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]  “THE” SPARQL protocol? No, but variants, subsets and local restrictions Semantics, links, quality?  …data accuracy (eg DBpedia)? [Paulheim2013]  …schema compliance & evolution [HoganJWS2012]  …vocabulary reuse? [D’AquinWebSci13] Stefan Dietze Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 05/04/17 7 SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil- Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013).
  7. 7. Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, May 2013. po:Programme yov:Video ? bibo:Book Vocabulary reuse/linking? 05/04/17 8Stefan Dietze
  8. 8. typeX typeX Co-occurence after mapping (201 frequently occuring types, mapped into 79 types) bibo:Film bibo:Document po:Programme bibo:Book foaf:Document yov:Video typeX Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) 05/04/17 9 Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, May 2013.
  9. 9. “Completeness” ? 05/04/17Stefan Dietze 10  Example: varying completeness of “book” (“movie”) entity descriptions  Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in Freebase and 60.9 % (40%) in Wikidata (varies heavily across attributes) Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  10. 10. Consistency? Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., ISWC2014 05/04/17Stefan Dietze 11
  11. 11. Challenge for search/retrieval – heterogeneity of datasets & entities Stefan Dietze 05/04/17 ??? ?? ? Discovery of suitable (1) datasets & (2) entities:  Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?  Topics/scope? Datasets/entities useful & trustworthy for topic XY?  Types? Datasets/entities about statistics, organisations, videos, slides, publications etc? 12
  12. 12. Overview 05/04/17Stefan Dietze 13 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of semantics/structured data on the Web („Future“) Dealing with heterogeneity & shortcomings („Now“)
  13. 13. 05/04/17 Dataset recommendation I 14 S Linkset1 Linkset2 Approach  Given dataset s, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Approach 1: vocabulary overlap  Approach 2: existing links (SNA)  Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions  Roughly 50% MAP for both approaches  Simplistic approach (!) Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A., Dietze, S., Two approaches to the dataset interlinking recommendation problem, 15th International Conference on Web Information System Engineering (WISE 2014), Thessaloniki, Greece. Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze 14 Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)
  14. 14. Dataset recommendation II 05/04/17 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016 Stefan Dietze 15 L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013. Preprocessing Datasets rankingDatasets filtering
  15. 15. Dataset recommendation II: results 05/04/17Stefan Dietze 16 Data & ground truth  Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)  Concept profiles from http://lov.okfn.org  Ground truth: existing links from VOID profiles of datasets (issue: not always representative for actual linksets) Results  MAP for different similarity thresholds from step 2 max. 54% (UMBC@0.7)  Recall 100% below indicated similarity (clustering) thresholds Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
  16. 16. Dataset search through dataset cataloging & profiling Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)  LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)  Original datasets published with key content providers, automatically extracted metadata 05/04/17 17Stefan Dietze
  17. 17. 05/04/17 18Stefan Dietze LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling Schema/Types http://data.linkededucation.org/linkedup/catalog/
  18. 18. 05/04/17 19Stefan Dietze LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling [ESWC14] Dataset topic profiles http://data.linkededucation.org/linkedup/catalog/
  19. 19. db:Biology db:Cell biology Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Lecture 29 – Stem Cells</dc:title> … </yo:Video…> Yovisto Video  Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?  Technically trivial through established NER/NED approaches, but scalability issues (recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)  Efficient approach: sampling & ranking for balance between scalability and precision /recall Scalable profiling of datasets A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). db:Cell (Biology) 05/04/17 21 db:Cell (Biology) Stefan Dietze
  20. 20. Efficient dataset profiling 1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling) 2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion) 3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)  Result: weighted dataset-topic profile graph 05/04/17 22Stefan Dietze A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  21. 21. Search & exploration of datasets through topic profiles  Applied to entire LOD cloud/graph  Visual exploration of extracted RDF dataset profiles (datasets, topics, relationships)  Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets) http://data-observatory.org/lod-profiles/ 05/04/17 23Stefan Dietze
  22. 22. Search: entity retrieval on large LD crawls?  How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?  State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)  Challenges/observations:  Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods  Query type affinity? 05/04/17 24Stefan Dietze ?? Large dataset/crawl e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory entities related to <Tim Berners Lee> ? BTC2014 DyLDO
  23. 23. Entity retrieval: approach (I) Offline processing (clustering to address link sparsity) 1. Feature vectors (lexical and structural features) 2. Bucketing: per type (LSH algorithm) 3. Clustering: X-means & Spectral clustering per bucket Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). (II) Online processing (retrieval) 1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities) 2. Re-Ranking (context terms & query type affinity) 05/04/17 25Stefan Dietze
  24. 24. Dataset  BTC2014 (4 billion entities)  92 SemSearch queries Methods  Our approaches: XM: Xmeans, SP: Spectral  Baselines B: BM25F, S1: Tonon et al [SIGIR12] Conclusions  XM & SP outperform baselines  Clustering to remedy link sparsity (yet extensive offline processing required)  Relevance to query more important than relevance to BM25F results Entity retrieval: evaluation 05/04/17 26Stefan Dietze Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015).
  25. 25. PROFILES2017 - Profiling & search of Linked Data 05/04/17 27Stefan Dietze https://profiles2017.wordpress.com/ • Probably co-located with ISWC2017 (Vienna) • Submissions due 21 June
  26. 26. Overview 05/04/17Stefan Dietze 28 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of structured data on the Web („Future“)? Dealing with heterogeneity & shortcomings („Present“)
  27. 27.  Linked Data: approx. 1000+ datasets & 100 billion statements  Open Data: XXX datasets Web semantics & entity-centric Web data 05/04/17 29Stefan Dietze  Web (of documents): approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google  Other forms of Web semantics and entity-centric knowledge?  Dynamics?  Quality?  Accessibility?  Scale?
  28. 28.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)  Adoption on the Web: 26 % (2014 Google study of 12 bn Web pages)  “Web Data Commons” (Meusel & Paulheim [ISWC2014]) • Markup from Common Crawl (3.2 billion pages): 44 billion RDF quads (2016) • Markup in 38% of pages in 2016  Same order of magnitude as “the Web” (!) Embedded Web page markup & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 05/04/17 30 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze http://webdatacommons.org
  29. 29.  schema:Product instances in WDC2015  Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)  Providers (distinct Pay Level Domains, PLDs): 93.705  Power law distribution of terms across PLDs  Top 10 PLDs  Top provider ? (company) 05/04/17 31Stefan Dietze Example: embedded Web markup about „products“ PLD # Resources www.crateandbarrel.com 33.517.936,00 www.bentgate.com 17.215.499,00 www.aliexpress.com 9.621.943,00 www.ebay.com.au 8.861.308,00 us.fotolia.com 7.939.982,00 www.ebay.co.uk 6.556.820,00 www.competitivecyclist.com 6.214.500,00 www.maxstudio.com 6.075.626,00 approx. 35 million resources
  30. 30. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Study on sample Web crawl (WDC2015)  Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)  Top 5 domains: Springer, MDPI, BMJ, mendeley.com, Biodiversitylibrary.org Domains, topics, disciplines?  Life Sciences and Computer Science predominant  Top-10 article titles  Noise Example: markup of bibliographic resources 05/04/17 32Stefan Dietze Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., Analysing Structured Scholarly Data embedded in Web Pages, SAVE-SD2016, co-located with the WWW2016
  31. 31. Example: markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources  Developed through DCMI Task Force on LRMI  Approx. 5000 PLDs (incl. subdomains) in CC  LRMI adoption (WDC) [WWW17]:  2015: 44,108,511 quads  2014: 30,599,024 quads  2013: 10.636873 quads 05/04/17 33 Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), Digital Learning track, Perth, April 2017. Stefan Dietze
  32. 32. Example: markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources  Developed through DCMI Task Force on LRMI  Approx. 5000 PLDs (incl. subdomains) in CC  LRMI adoption (WDC) [WWW17]:  2015: 44,108,511 quads  2014: 30,599,024 quads  2013: 10.636873 quads  Frequent errors and unintended use (e.g. porn) 05/04/17 34 Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), Digital Learning track, Perth, April 2017. Stefan Dietze 7xxxtube.com 1amateurporntube.com virtualpornstars.com sunriseseniorliving.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de
  33. 33. 05/04/17 35Stefan Dietze Entity retrieval on Web markup: state of the art  Glimmer (http://glimmer.research.yahoo.com)  Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]  BM25F retrieval model on WDC index
  34. 34. Web markup: challenges 05/04/17 36 Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) in CommonCrawl Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC Lack of links Largely unlinked entity descriptions Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 %, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)  Using markup as knowledge graph, similar to Linked Data? Stefan Dietze A Survey on Challenges for Entity Retrieval in Markup Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th International Semantic Web Conference (ISWC2016), Kobe, Japan (2016). “Strings, not things”  Bias towards datatype properties / using any property as such (!)  Numbers from LRMI2015 markup corpus: o 46 million “transversal” quads (i.e. excluding hierarchical statements such as rdfs:typeOf) o 64 % are actual datatype properties yet 97% refer to literals (up from 70% in 2013)  Challenges o Markup data = flat entity descriptions (=> fairly unconnected graph) o Data reuse requires identity resolution
  35. 35.  Obtaining consolidated & verified entity description/facts (or graph) for a given resource/entity from Web markup?  Aiding tasks: such as document annotation, augmentation or enrichment of existing data- or knowledge bases/graphs Entity retrieval & reconciliation on markup 05/04/17 37 Query iPhone 6, type:(Product) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB <e1, s:name, „Iphone 6“> <e2, s:brand, „Apple Inc.“> <e3, s:brand, „Apple“> <e4, s:weight, 127> <e5, s:releaseDate, „1.12.1972“> Web (crawl) (e.g. Common Crawl/WDC, focused crawl) Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  36. 36. FuseM: query-centric data fusion on Web markup 05/04/17 38  Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching  Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering) 1. Matching 2. Fact selection New Queries Foxconn, type:(Organization) Cupertino, type:(City) Apple Inc., type:(Organization) (supervised SVM classifier) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB Query iPhone 6, type:(Product) Candidate Facts node1 brand _node-x node1 brand Apple Inc. node1 weight 129 node2 weight 172 node2 manufacturer Foxconn node3 releasedate 01.12.1972 node3 manufacturer Foxconn Web page markup Web (crawl) approx. 125.000 facts for „iPhone6“ Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017.
  37. 37. FuseM classifier: features 05/04/17 39Stefan Dietze
  38. 38. Evaluation & results: data fusion performance 05/04/17 40Stefan Dietze Setup  Dataset: Products, Movies, Books (approx. 3 billion. facts) from Common Crawl / WDC  Baselines:  BM25: top-k diverse facts via BM25 (Glimmer)  CBFS: clustering-based approach [ESWC2015]  PreRecCorr: “Fusing data with correlations” [Pochampally et. al., ACM SIGMOD 2014]  10-fold cross validation Results  FuseM beats baselines in both tasks (strong variance of baselines across tasks)  All feature categories contribute Query-centric data fusion (precision) Query-independent data fusion (P/R/F1)
  39. 39. 05/04/17 42Stefan Dietze Results: example of fused entity description  Data fusion result for book „Brideshead Revisited“ (20 distinct facts) New facts (compared to DBpedia): • 60% - 70% of all facts for books & movies new (across all KBs) • 100% new for products („long tail entities“ not existing in KBs yet) New facts and attributes
  40. 40. 05/04/17 43Stefan Dietze Results: KB augmentation  Augmentation of 15 properties of books (& movies) in three KBs  DB: DBpedia  FB: Freebase  WD: Wikidata  Augmentation performance: % of filled slots (or „knowledge gaps“) in KB  Performance varies heavily (yet some attributes completed to 100%) KBA result for entities of type „Book“ Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  41. 41. Linked Data & knowledge graphs Conclusions & outlook 05/04/17 45Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search
  42. 42. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup/tables Unstructured (Web) data/docs Linked Data & knowledge graphs Conclusions & outlook 05/04/17 46Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search  New forms of (structured) Web data: Web markup (schema.org et al.) & tables o Convergence of structured and unstructured Web (e.g. Voldemort KG, Tonon et al., ISWC2016) o Scale and dynamics (!) o Potential to augment existing knowledge graphs (e.g. Google KG or Microsoft Satori) o Potential training data for NED, entity interlinking and other entity-centric tasks (e.g. OKE Challenge)
  43. 43. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Contact & resources 05/04/17 47Stefan Dietze @stefandietze http://stefandietze.net More on Web markup: talk on Wednesday, 11:00, WW2017/Digital Learning track Embedded data/markup/tables Unstructured (Web) data/docs Linked Data & knowledge graphs

×