Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What makes a linked data pattern interesting?

137 views

Published on

A short talk on the problem of mining linked data (RDF) patterns, introducing a few preliminary notions towards the definition of generic linked data mining algorithms.

Published in: Technology
  • Be the first to comment

What makes a linked data pattern interesting?

  1. 1. What makes a linked data pattern interesting? Szymon Klarman Department of Computer Science Brunel University London June 7, 2016 Connected Data London #ConnectedData2016
  2. 2. Linked Data  data/knowledge represented in W3C standards OWL/RDF(S)  flexible, unrestrictive, extendible  machine (and human) accessible  connected into a global Web of Data  (open) and reusable (and when combined great things might happen!)  perfectly functional also in closed environments
  3. 3. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type has entity idlabel GRB2 regulates GAB1 UniProt:P34723
  4. 4. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type
  5. 5. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type Regulation Molecular interaction Biological event subclass of subclass of has participant A has participant subproperty pf domain range Chemical has participant B
  6. 6. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type has participant Molecular Interaction Biological event type type Chemical type Regulation Molecular interaction Biological event subclass of subclass of has participant A has participant subproperty pf domain range Chemical has participant B
  7. 7. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type Querying: ?z ?y has participant Biological event Chemical type type Regulation Molecular interaction Biological event has participant A has participant domain range Chemical has participant B subclass of subclass of subproperty pf
  8. 8. Linked data mining Emerging field: Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data since 2012 (+ Linked Data Mining Challange). Problems:  finding novel/surprising/interesting linked data patterns  identifying relevant semantic connections  predicting facts/links in knowledge graphs Most modest yet fundamental task: What’s in that linked data set?  Web of Data will soon contain a lot of significant answers (42!)...  ...so we need to know how to ask the right question...  ...so we need to understand what’s in these data set. Examples are from the Big Mechanism project (http://52.26.26.74/).
  9. 9. So what’s in that linked data set?
  10. 10. So what’s in that linked data set?
  11. 11. So what’s in that linked data set? Too much too noisy...
  12. 12. So what’s in that linked data set?
  13. 13. So what’s in that linked data set? No structure...
  14. 14. Ontologies on the Web of Data Concept & property hierarchies + type assertions make up most of the Web of Data. B. Glimm, A. Hogan, M. Krötzsch, A. Polleres: „OWL: Yet to arrive on the Web of Data?”, 2012 Typical ontologies don’t reflect the actual graph structure of data...
  15. 15. Biological event Chemical / Event Statement Article Journal representsis represented by is extracted from Molecular interaction has participanttype Submitter has submitter The actual „conceptual data model” published in
  16. 16. GRB2_regulates_GAB1 statement_1 GRB2_MOUSE GAB1_MOUSE has participant A has participant B NaCTeM has submitter PMC123456 extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Biological event type
  17. 17. ?z ?u ?x ?y has participant A has participant B ?v has submitter ?w extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Biological event type
  18. 18. ?z ?u ?x ?y has participant A has participant B ?v has submitter ?w extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Linked data pattern ≈ conjunctive query / graph query Query is a set of triples of the form: ( ?x type Concept ) ( ?x Property ?y ) Linked data mining ≈ search through the query space Biological event type
  19. 19. When is a linked data pattern interesting? Two evaluation criteria:  Frequency: the pattern has relatively many matches in the set;  Semantic content: the pattern contains relatively much information. Frequency is the central criterion for the related problem of frequent subgraph mining in the graph & multi-relational data setting. ⇢ linked data is graph data. Semantic content criterion originates in logical/semantic theories of information, and is used in inductive logic programming. ⇢ linked data is grounded in logic. There is an inherent trade-off between the two criteria.
  20. 20. Frequency The most frequent linked data patterns out there will always be: X is something... Something is somehow related to something else... ?x ?y owl:topObjectProperty owl:Thing typetype X is an event of type...?
  21. 21. Semantic content regulation molecular interaction biological event The more possibilities you exclude the more you say. owl:Thing
  22. 22. Semantic content The linked data pattern with the most semantic content is the entire RDF graph... Pattern Q1 has more semantic content than pattern Q2 (over ontology O) if Q1 (with O) logically entails Q2 ?z ?y has participant A Regulation Protein type type ?z ?y has participant Biological event Chemical type type
  23. 23. Trade-off FREQ (Q) CONT (Q) VALUE(Q) = weighted sum of FREQ(Q) and CONT(Q) 1 - Prob(Q is true a priori)#answers / #possible answers 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont
  24. 24. Trade-off 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont Q1 = textual_entity(x) Q2 = statement(x) Q3 = event(x) Q4 = journal_article(x), published_in(x, u), journal(u), is_extracted_from(w, x), statement(w), contained_in(w, y), table(y), represents(w, v), negative_regulation(v), has_submitter(y, z), submitter(z), [...] (10 variables) Q5 = table(x), has_submitter(x, z), submitter(z), contains_statement(x, y), statement(y), contained_in(y, x) Q6 = positive_regulation(z), is_represented_by(z, y), statement(y), represents(y, z), contained_in(y, x), table(x), has_submitter(x, v), submitter(v), contains_statement(x, y).
  25. 25. Algorithm The space of all patterns over realistic linked data sets is virtually infinite. But there are some good search heuristics:  use precomputed „promising” building blocks;  „climb up” over the most successful queries so far (but use a restart rule to avoid getting stuck locally). 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont
  26. 26. What’s next... The question „what’s in that linked data set?” is perhaps not the major one, but the suggested notion of interestingness might well be:  „frequency vs. semantic content” trade-off reflects the dual – graphical and logical – nature of the RDF(S) representation model.  many of the linked data mining tasks can be described as: given Q2 find an interesting Q1 such that: Q1 ⇢ Q2  other, more abstract criteria might be also necessary. Linked data mining requires novel principles and foundational approaches.

×