Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining and Managing Large-scale Linked Open Data

773 views

Published on

Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.

Published in: Data & Analytics
  • Be the first to comment

Mining and Managing Large-scale Linked Open Data

  1. 1. Slide 1Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Ansgar Scherp Mining and Managing Large-scale Linked Open Data GVDB, Nörten-Hardenberg, May 25, 2016 Thanks to: Chifumi Nishioka, Renata Dividino, Thomas Gottron, and many more …
  2. 2. Slide 2Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Team Knowledge Discovery @ Ansgar Scherp Ahmed Saleh Chifumi Nishioka Falk Böschen Mohammad Abdel-Qader Till Blume Anke Koslowski (Secretariat) Henrik Schmidt (Engineer) Lukas Galke Florian Mai &
  3. 3. Slide 3Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Linked Open Data (LOD) Cloud • Publishing and interlinking data on the web • Different quality, purpose, and sources • Using the Resource Description Framework(RDF) World Wide Web LOD Cloud Documents Data Hyperlinks via <a> Typed Links HTML RDF Addresses (URIs) Addresses (URIs)
  4. 4. Slide 4Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Relevance of Linked Data?
  5. 5. Slide 5Prof. Ansgar Scherp – asc@informatik.uni-kiel.de1000+ Datasets, 50+ Billion Triples Media Geographic Publications Web 2.0 eGovernment Cross-Domain Life Sciences Linked Data: May ‘07  August ‘14 Source: http://lod-cloud.net Social Networking
  6. 6. Slide 6Prof. Ansgar Scherp – asc@informatik.uni-kiel.de LOD on One Slide: Example Graph biglynx:matt-briggs foaf:Person rdf:type Fully qualified URI using vocabulary prefixes: @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> . @prefix biglynx: <http://biglynx.co.uk/people/> . Object Predicate Subject RDF Triple
  7. 7. Slide 7Prof. Ansgar Scherp – asc@informatik.uni-kiel.de LOD on One Slide: Example Graph biglynx:matt-briggs foaf:Person rdf:type Fully qualified URI using vocabulary prefixes: @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> . @prefix biglynx: <http://biglynx.co.uk/people/> . biglynx:Director rdf:type … …
  8. 8. Slide 8Prof. Ansgar Scherp – asc@informatik.uni-kiel.de LOD on One Slide: Example Graph biglynx:matt-briggs foaf:Person biglynx:dave-smith biglynx:Director rdf:type foaf:knows rdf:type _1:point wgs84: lat wgs84: long dp:London foaf:based_near …… … … ex:loc “-0.118” “51.509” Types Properties Entity
  9. 9. Slide 9Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Motivation for the SchemEX Index • Single entry point to query the LOD cloud • Search for data sources containing entities like – ‘Persons, who are Politicians and Actors’ – ‘Research data sets’ – ‘Scientific publications’ Query SELECT ?x FROM … WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician . } Index1 2 2 2
  10. 10. Slide 10Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Input Data for SchemEX • Quads: <subject> <predicate> <object> <context> • Example: <http://biglynx.co.uk/people/matt-briggs> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://biglynx.co.uk/people/matt-briggs.rdf> <http://biglynx.co.uk/people/ matt-briggs.rdf> rdf:type biglynx: matt-briggs foaf: Person LOD Cloud Dataset 𝑋
  11. 11. Slide 11Prof. Ansgar Scherp – asc@informatik.uni-kiel.de SchemEX Idea • Schema-level index SchemEX • Assign RDF entities to graph patterns • Map graph patterns to data sources (context) • Defined over entities, but store the context • Construction of schema-level index • Stream-based for scalability • Stratified bi-simulation for detecting patterns • Little loss of accuracy [KGS+12]
  12. 12. Slide 12Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Building the Index from a Stream • Stream of quads coming from a LD crawler … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 4 3 2 1 1 6 2 3 4 5 C3 C2 C2 C1 + Reasonable accuracy at cache size of 50k
  13. 13. Slide 13Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Full BTC 2011Data Set: 2.17 Bn Triples Cache size: 50 k Winner BTC’11 + Linear runtime with respect to number of triples + Memory consumption scales with window size
  14. 14. Slide 14Prof. Ansgar Scherp – asc@informatik.uni-kiel.de [GSK+13] Generalization Specialization Result list with examples Inspired by Google
  15. 15. Slide 15Prof. Ansgar Scherp – asc@informatik.uni-kiel.de LODatio Under the Hood SPARQL Snippets Generalize Retrieve Data Sources Query translation Rank Specialize Count Select Select • Hybrid database with off-the-shelf components
  16. 16. Slide 16Prof. Ansgar Scherp – asc@informatik.uni-kiel.de LOD on One Slide: Recap biglynx:matt-briggs foaf:Person biglynx:dave-smith biglynx:Director rdf:type foaf:knows rdf:type _1:point wgs84: lat wgs84: long dp:London foaf:based_near …… … … ex:loc “-0.118” “51.509” Type Set (TS) Property Set (PS) Information theoretic analyses of LOD • How much information is encoded in TS and PS? • … information encoded, once TS or PS is known? • … to which degree are TS and PS redundant? • Example: 20% of PLDs do not need TS (6% for PS) [GKS15]
  17. 17. Slide 17Prof. Ansgar Scherp – asc@informatik.uni-kiel.de • 29 weekly LOD snapshots of ~100 Mio triples • Still running since May 2012 (now 200+ weeks) Käfer et al.’s Temporal Analysis of LOD • Data on the cloud changes a lot [Käfer et al., 2013] T. Käfer, A. Abdelrahman, J. Umbrich, P. O'Byrne, A. Hogan: Observing Linked Data Dynamics. ESWC 2013: 213-227 Changes? • But vocabularies defining RDF types and properties are highly static, e.g., RDF, FOAF LOD cloud ~2012 LOD cloud ~2014
  18. 18. Slide 18Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 𝐻(𝑃𝑆|𝑇𝑆=𝑡𝑠) 𝐻(𝑇𝑆|𝑃𝑆=𝑝𝑠) But:DoChangesOccurinPS and TS? • Analysis: expected conditional entropy over time • 𝐻(𝑃𝑆|𝑇𝑆 = 𝑡𝑠): entropy of 𝑃𝑆 given 𝑇𝑆 is known • Observation: types become less important • Changes in the use of TS and PS ? !
  19. 19. Slide 19Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Changes over Time • Extended characteristic sets: ECS = PS ∪ TS # of ECS Avg.: 83.898 ECS per week # of ECS [DSG+13] • Avg. 73% of ECS re-occur next week (orange) • Avg. 35% of ECS remain unchanged (blue) • Avg. 20% of entity sets of ECS change / week [Neumann and Moerkotte, 2011] Thomas Neumann, Guido Moerkotte: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. ICDE 2011: 984-994 [Neumann and Moerkotte, 2011]
  20. 20. Slide 20Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Temporal Dynamics of the Entities? • Notion of entity motivated by ECS: entity is a set of triples 𝑋 sharing the same subject URI 𝑠 • Example: –1 entity –4 triples w.l.o.g. • Useful to keep LOD caches up-to-date? • Can we predict when LOD sources will change?
  21. 21. Slide 21Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Dynamics Function Θ • Definition of Θ over change rate function 𝑐(𝑋𝑡) Time X 𝑡𝑖 𝑡𝑗 Θ Θ 𝑡 𝑖 𝑋 = Θ(𝑋𝑡 𝑗 ) − Θ(𝑋𝑡 𝑖 ) = 𝑡 𝑖 𝑡 𝑗 𝑐 𝑋𝑡 d𝑡 [DGS+14] 𝑡𝑗 ≈ 𝑘=𝑖+1 𝑗 𝛿(𝑋𝑡 𝑘−1 , 𝑋𝑡 𝑘 ) • Approximation as step function over changes Monotone, non-negative c
  22. 22. Slide 22Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Update Strategies for LOD Sources • Apply strategies from keeping caches of WWW documents up-to-date to maintain LOD caches • Assumptions –LOD is fetched from various sources –Sources are scored and prioritized based on strategy –Data of a source is fetched only when the operation can be entirely executed
  23. 23. Slide 23Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Scheduling Update Strategies a) HTTP Header [Dividino et al., 2014a] b) Age or Last Visited [Dasdan et al., 2009, Cho and Garcia-Molina, 2000] c) PageRank [Page et al., 1999, Boldi et al., 2004, Baeza-Yates et al., 2005] d) LOD Sources Size e) Change Ratio [Douglis et al., 1997, Cho et al., 2002. Tan et al., 2007] f) Change Rate [Olston et al., 2002, Ntoulas et al., 2004, Dividino et al., 2013] g) History Information: Dynamics [Dividino et al., 2014b] We borrow strategies developed for the WWW and metrics for data change analysis in the LOD cloud.
  24. 24. Slide 24Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Ranking Sources which changed (most) Sources that not changed/less changesTimeti tj e) Change Ratio • Captures the change frequency of the data (freshness) • Percentage of data items in the cache that are up-to-date
  25. 25. Slide 25Prof. Ansgar Scherp – asc@informatik.uni-kiel.de f) Change Rate • Data from sources which are less similar which their previous update (snapshot) should be updated first • Comparison of two RDF data sets – 𝑋 : Set of triple statements – 𝛿 : Numeric expression (distance) 𝛿𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑋1, 𝑋2 = 1 − 𝑋1 ∩ 𝑋2 𝑋1 ∪ 𝑋2 0,¥[ ) Time𝑡𝑖 𝑡𝑗 𝛿 Example:
  26. 26. Slide 26Prof. Ansgar Scherp – asc@informatik.uni-kiel.de g) History Information: Dynamics • Data from sources which most evolve in a given period of time should be updated first • Uses both history information and change rate Θ(𝑋𝑡 𝑗 ) − Θ(𝑋𝑡 𝑖 ) = 𝑡 𝑖 𝑡 𝑗 𝑐 𝑋𝑡 d𝑡 Time X 𝑡𝑖 𝑡𝑗 Θ c ≈ 𝑘=𝑖+1 𝑗 𝛿(𝑋𝑡 𝑘−1 , 𝑋𝑡 𝑘 )
  27. 27. Slide 27Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation  Idea: simulation of limitations of available computational resources (network bandwidth, computation time) Time 100% 𝑡𝑖 𝑡𝑖+1
  28. 28. Slide 28Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation: Single Step Update Time 100% 15% 5%40% 75% 95%60% 𝑡𝑖 𝑡𝑖+1
  29. 29. Slide 29Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation: Iterative Updates Time . . . 15% 5%40% 75% 95%60% 15% 5%40% 75% 95%60% 100% 𝑡𝑖 𝑡𝑖+1 𝑡𝑖+2
  30. 30. Slide 30Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Dataset • Dynamic Linked Data Observatory • Weekly snapshots, 14 M triples  154 snapshots (approx. 3 years)  590 data sources (PLD) Top 10 largest data sources Average size dbpedia.org 3,406,364.5 edgarwrap.ontologycentral.com 982,631.0 dbtune.org 864,107.6 dbtropes.org 787,299.9 data.linkedct.org 498,986.3 aims.fao.org 416,708.9 www.legislation.gov.uk 399,601.6 kent.zpr.fer.hr 387,034.8 identi.ca 278,316.2 webenemasuno.linkeddata.es 250,557.9
  31. 31. Slide 31Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Metrics:Precision & Recall • Precision: portion of cached data that are actually up-to-date • Recall: portion of data in the LOD cloud that is identical to the cached data Cached data Actual data on the LOD cloud (w.r.t. to the 590 sources considered)
  32. 32. Slide 32Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Results: Single Step Update Time t jti 100% 15% 5%40% 75% 95%60%
  33. 33. Slide 33Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Results: Iterative Updates Time tjti tj . . . 15% 5%40% 75% 95%60% 15% 5% 40% 75% 95%60% 100%
  34. 34. Slide 34Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Results: Iterative Updates Time tjti tj . . . 15% 5%40% 75% 95%60% 15% 5% 40% 75% 95%60% 100%
  35. 35. Slide 35Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Results: Iterative Updates Time tjti tj . . . 15% 5%40% 75% 95%60% 15% 5% 40% 75% 95%60% 100%
  36. 36. Slide 36Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Results: Summary  Best strategies: ones which capture the change behaviour over time  Specially for low relative bandwidth
  37. 37. Slide 37Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Dynamics Function Θ: Revisited Time X 𝑡𝑖 𝑡𝑗 c • Can we predict when LOD sources will change? • Notion of dynamics to compute periodicities! • Dynamics as vector of changes: < 𝛿(𝑋𝑡1 , 𝑋𝑡2 ), … , 𝛿(𝑋𝑡 𝑁−1 , 𝑋𝑡 𝑁 ) >
  38. 38. Slide 38Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Temporal Clustering of Entities • Dynamics as vector: < 𝛿(𝑋𝑡1 , 𝑋𝑡2 ), … , 𝛿(𝑋𝑡 𝑁−1 , 𝑋𝑡 𝑁 ) > Time Change(logscale) [NS15] • Clustering with k-means++ to find patterns • 165 snapshots • 65,044 entities • 7 patterns (after optimizing 𝑘)
  39. 39. Slide 39Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Periodicity of Entity Dynamics • Examples: < 0, 3, 2, 0, 3, 2, 0 >, < 1, 2, 1, 2, 1, 2 > # of entities Most likely periodicity C1 12,982 66 C2 168 23 C3 35 1 C4 12 1 C5 1 1 C6 1,541 56 C7 30 37 CS 50,725 [Elfeky et al., 2005] Mohamed G. Elfeky, Walid G. Aref, Ahmed K. Elmagarmid: Periodicity Detection in Time Series Databases. IEEE Trans. Knowl. Data Eng. 17(7): 875-887 (2005) • Convolution-based algorithm [Elfeky et al. 2005] • Entities of legislation.gov.uk found in several clusters (C1,C3,C4,C5,C6) • No changes (CS): 77.29% • CS: entities from w3.org and ontologydesignpatterns.org
  40. 40. Slide 40Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Application Areas: More than One! • Searching for LOD sources [GSK+13,KGS+12] • Strategies for updating data caches [DGS15] • Programming queries against LOD [SSS12] • Recommending LOD vocabularies [SGS16]  Foundation for Future Data-driven Applications
  41. 41. Slide 41Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Summary: KDD in Social Media & DL How to deal with the vast amount of content related to research and innovation? • H2020 INSO-4 project, duration: 04/2016-03/2019 • Data mining & visualization tools enabling information professionals to deal with large corpora • Website: http://www.moving-project.eu/ New
  42. 42. Slide 42Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Got Interested? Knowledge Discovery at ZBW Contact me! Prof. Dr. Ansgar Scherp • Email: a.scherp@zbw.eu • Twitter: https://twitter.com/ansgarscherp • Slideshare: http://de.slideshare.net/ascherp • KD-Website: http://www.zbw.eu/en/research/knowledge-discovery/ http://www.kd.informatik.uni-kiel.de/en/
  43. 43. Slide 43Prof. Ansgar Scherp – asc@informatik.uni-kiel.de References [DGS15] R. Dividino, T. Gottron, A. Scherp: Strategies for Efficiently Keeping Local Linked Open Data Caches Up-To-Date. International Semantic Web Conference (2) 2015: 356-373 [DGS+14] R. Dividino, T. Gottron, A. Scherp, G. Gröner: From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources. PROFILES@ESWC 2014 [GKS15] T. Gottron, M. Knauf, A. Scherp: Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distributed and Parallel Databases 33(4): 515-553 (2015) [DSG+13] R. Dividino, A. Scherp, G. Gröner, T. Gottron: Change-a-LOD: Does the Schema on the Linked Data Cloud Change or Not? COLD 2013 [GSK+13] T. Gottron, A. Scherp, B. Krayer, A. Peters: LODatio: using a schema-level index to support users in finding relevant sources of linked data. K-CAP 2013: 105-108 [KGS+12] M. Konrath, T. Gottron, S. Staab, A. Scherp: SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16: 52-58 (2012) [NS15] C. Nishioka, A Scherp: Temporal Patterns and Periodicity of Entity Dynamics in the Linked Open Data Cloud. K-CAP 2015. [SGS16] J. Schaible, T. Gottron, and A. Scherp: TermPicker Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud, ESWC, Springer, 2016. [SSS12] S. Scheglmann, A. Scherp, S. Staab: Declarative Representation of Programming Access to Ontologies. ESWC 2012: 659-673
  44. 44. Slide 44Prof. Ansgar Scherp – asc@informatik.uni-kiel.de a) HTTP Header • Data from sources which have been changed since the last update should be updated first HTTP Response HEADER … Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT CONTENT
  45. 45. Slide 45Prof. Ansgar Scherp – asc@informatik.uni-kiel.de b) Age or Last Visited • Time elapsed from last update (the difference between query time and last update time) • It guarantees that every source is updated after a period Ranking Sources that have been at longer time updated Sources that have been recently updated
  46. 46. Slide 46Prof. Ansgar Scherp – asc@informatik.uni-kiel.de c) PageRank and d) Source Size • PageRank captures popularity/ importance of the LOD source • Data from sources with highest PageRank are updated first • LOD source size: data from the biggest/smallest LOD sources should be updated first Ranking Sources with higher PR Sources with lower PR
  47. 47. Slide 47Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Results: Single Step Update Time t jti 100% 15% 5%40% 75% 95%60%

×