What makes a linked data pattern
interesting?
Szymon Klarman
Department of Computer Science
Brunel University London
June 7, 2016
Connected Data London
#ConnectedData2016
Linked Data
 data/knowledge represented in W3C standards OWL/RDF(S)
 flexible, unrestrictive, extendible
 machine (and human) accessible
 connected into a global Web of Data
 (open) and reusable (and when combined great things might happen!)
 perfectly functional also in closed environments
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
has entity idlabel
GRB2 regulates GAB1 UniProt:P34723
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
Regulation
Molecular interaction
Biological event
subclass of
subclass of
has participant A
has participant
subproperty pf
domain range
Chemical
has participant B
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
has participant
Molecular Interaction
Biological event
type
type
Chemical
type
Regulation
Molecular interaction
Biological event
subclass of
subclass of
has participant A
has participant
subproperty pf
domain range
Chemical
has participant B
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
Querying:
?z ?y
has participant
Biological event Chemical
type type
Regulation
Molecular interaction
Biological event
has participant A
has participant
domain range
Chemical
has participant B
subclass of
subclass of
subproperty pf
Linked data mining
Emerging field: Workshop on Knowledge Discovery and Data Mining Meets
Linked Open Data since 2012 (+ Linked Data Mining Challange).
Problems:
 finding novel/surprising/interesting linked data patterns
 identifying relevant semantic connections
 predicting facts/links in knowledge graphs
Most modest yet fundamental task:
What’s in that linked data set?
 Web of Data will soon contain a lot of significant answers (42!)...
 ...so we need to know how to ask the right question...
 ...so we need to understand what’s in these data set.
Examples are from the Big Mechanism project (http://52.26.26.74/).
So what’s in that linked data set?
So what’s in that linked data set?
So what’s in that linked data set?
Too much too noisy...
So what’s in that linked data set?
So what’s in that linked data set?
No structure...
Ontologies on the Web of Data
Concept & property hierarchies + type assertions make up most of the Web of Data.
B. Glimm, A. Hogan, M. Krötzsch, A. Polleres: „OWL: Yet to arrive on the Web of Data?”, 2012
Typical ontologies don’t reflect the actual
graph structure of data...
Biological event
Chemical / Event
Statement
Article
Journal
representsis represented by
is extracted from
Molecular interaction
has participanttype
Submitter
has submitter
The actual „conceptual data model”
published in
GRB2_regulates_GAB1
statement_1
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
NaCTeM
has submitter
PMC123456
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Biological event
type
?z
?u
?x ?y
has participant A has participant B
?v
has submitter
?w
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Biological event
type
?z
?u
?x ?y
has participant A has participant B
?v
has submitter
?w
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Linked data pattern ≈ conjunctive query / graph query
Query is a set of triples of the form:
( ?x type Concept )
( ?x Property ?y )
Linked data mining ≈ search through the query space
Biological event
type
When is a linked data pattern interesting?
Two evaluation criteria:
 Frequency: the pattern has relatively many matches in the set;
 Semantic content: the pattern contains relatively much information.
Frequency is the central criterion for the related problem of frequent
subgraph mining in the graph & multi-relational data setting.
⇢ linked data is graph data.
Semantic content criterion originates in logical/semantic theories of
information, and is used in inductive logic programming.
⇢ linked data is grounded in logic.
There is an inherent trade-off between the two criteria.
Frequency
The most frequent linked data patterns out there will always be:
X is something...
Something is somehow related to something else...
?x ?y
owl:topObjectProperty
owl:Thing
typetype
X is an event of type...?
Semantic content
regulation
molecular interaction
biological event
The more possibilities you exclude the more you say.
owl:Thing
Semantic content
The linked data pattern with the most
semantic content is the entire RDF graph...
Pattern Q1 has more semantic content than pattern Q2 (over ontology O)
if
Q1 (with O) logically entails Q2
?z ?y
has participant A
Regulation Protein
type type
?z ?y
has participant
Biological event Chemical
type type
Trade-off
FREQ (Q) CONT (Q)
VALUE(Q) =
weighted sum of FREQ(Q) and CONT(Q)
1 - Prob(Q is true a priori)#answers / #possible answers
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
Trade-off
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
Q1 = textual_entity(x)
Q2 = statement(x)
Q3 = event(x)
Q4 = journal_article(x), published_in(x, u), journal(u),
is_extracted_from(w, x), statement(w), contained_in(w, y),
table(y), represents(w, v), negative_regulation(v),
has_submitter(y, z), submitter(z), [...] (10 variables)
Q5 = table(x), has_submitter(x, z), submitter(z), contains_statement(x, y), statement(y), contained_in(y, x)
Q6 = positive_regulation(z), is_represented_by(z, y), statement(y), represents(y, z), contained_in(y, x),
table(x), has_submitter(x, v), submitter(v), contains_statement(x, y).
Algorithm
The space of all patterns over realistic linked data sets is virtually infinite.
But there are some good search heuristics:
 use precomputed „promising” building blocks;
 „climb up” over the most successful queries so far (but use a restart rule
to avoid getting stuck locally).
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
What’s next...
The question „what’s in that linked data set?” is perhaps not the major one,
but the suggested notion of interestingness might well be:
 „frequency vs. semantic content” trade-off reflects the dual – graphical
and logical – nature of the RDF(S) representation model.
 many of the linked data mining tasks can be described as: given Q2 find
an interesting Q1 such that:
Q1 ⇢ Q2
 other, more abstract criteria might be also necessary.
Linked data mining requires novel principles and foundational approaches.

What makes a linked data pattern interesting?

  • 1.
    What makes alinked data pattern interesting? Szymon Klarman Department of Computer Science Brunel University London June 7, 2016 Connected Data London #ConnectedData2016
  • 2.
    Linked Data  data/knowledgerepresented in W3C standards OWL/RDF(S)  flexible, unrestrictive, extendible  machine (and human) accessible  connected into a global Web of Data  (open) and reusable (and when combined great things might happen!)  perfectly functional also in closed environments
  • 3.
    RDF(S) = graphstructure + logical inference b p has participant A Regulation Protein type type has entity idlabel GRB2 regulates GAB1 UniProt:P34723
  • 4.
    RDF(S) = graphstructure + logical inference b p has participant A Regulation Protein type type
  • 5.
    RDF(S) = graphstructure + logical inference b p has participant A Regulation Protein type type Regulation Molecular interaction Biological event subclass of subclass of has participant A has participant subproperty pf domain range Chemical has participant B
  • 6.
    RDF(S) = graphstructure + logical inference b p has participant A Regulation Protein type type has participant Molecular Interaction Biological event type type Chemical type Regulation Molecular interaction Biological event subclass of subclass of has participant A has participant subproperty pf domain range Chemical has participant B
  • 7.
    RDF(S) = graphstructure + logical inference b p has participant A Regulation Protein type type Querying: ?z ?y has participant Biological event Chemical type type Regulation Molecular interaction Biological event has participant A has participant domain range Chemical has participant B subclass of subclass of subproperty pf
  • 8.
    Linked data mining Emergingfield: Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data since 2012 (+ Linked Data Mining Challange). Problems:  finding novel/surprising/interesting linked data patterns  identifying relevant semantic connections  predicting facts/links in knowledge graphs Most modest yet fundamental task: What’s in that linked data set?  Web of Data will soon contain a lot of significant answers (42!)...  ...so we need to know how to ask the right question...  ...so we need to understand what’s in these data set. Examples are from the Big Mechanism project (http://52.26.26.74/).
  • 9.
    So what’s inthat linked data set?
  • 10.
    So what’s inthat linked data set?
  • 11.
    So what’s inthat linked data set? Too much too noisy...
  • 12.
    So what’s inthat linked data set?
  • 13.
    So what’s inthat linked data set? No structure...
  • 14.
    Ontologies on theWeb of Data Concept & property hierarchies + type assertions make up most of the Web of Data. B. Glimm, A. Hogan, M. Krötzsch, A. Polleres: „OWL: Yet to arrive on the Web of Data?”, 2012 Typical ontologies don’t reflect the actual graph structure of data...
  • 15.
    Biological event Chemical /Event Statement Article Journal representsis represented by is extracted from Molecular interaction has participanttype Submitter has submitter The actual „conceptual data model” published in
  • 16.
    GRB2_regulates_GAB1 statement_1 GRB2_MOUSE GAB1_MOUSE has participantA has participant B NaCTeM has submitter PMC123456 extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Biological event type
  • 17.
    ?z ?u ?x ?y has participantA has participant B ?v has submitter ?w extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Biological event type
  • 18.
    ?z ?u ?x ?y has participantA has participant B ?v has submitter ?w extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Linked data pattern ≈ conjunctive query / graph query Query is a set of triples of the form: ( ?x type Concept ) ( ?x Property ?y ) Linked data mining ≈ search through the query space Biological event type
  • 19.
    When is alinked data pattern interesting? Two evaluation criteria:  Frequency: the pattern has relatively many matches in the set;  Semantic content: the pattern contains relatively much information. Frequency is the central criterion for the related problem of frequent subgraph mining in the graph & multi-relational data setting. ⇢ linked data is graph data. Semantic content criterion originates in logical/semantic theories of information, and is used in inductive logic programming. ⇢ linked data is grounded in logic. There is an inherent trade-off between the two criteria.
  • 20.
    Frequency The most frequentlinked data patterns out there will always be: X is something... Something is somehow related to something else... ?x ?y owl:topObjectProperty owl:Thing typetype X is an event of type...?
  • 21.
    Semantic content regulation molecular interaction biologicalevent The more possibilities you exclude the more you say. owl:Thing
  • 22.
    Semantic content The linkeddata pattern with the most semantic content is the entire RDF graph... Pattern Q1 has more semantic content than pattern Q2 (over ontology O) if Q1 (with O) logically entails Q2 ?z ?y has participant A Regulation Protein type type ?z ?y has participant Biological event Chemical type type
  • 23.
    Trade-off FREQ (Q) CONT(Q) VALUE(Q) = weighted sum of FREQ(Q) and CONT(Q) 1 - Prob(Q is true a priori)#answers / #possible answers 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont
  • 24.
    Trade-off 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200300 400 500 600 700 800 900 Value Freq Cont Q1 = textual_entity(x) Q2 = statement(x) Q3 = event(x) Q4 = journal_article(x), published_in(x, u), journal(u), is_extracted_from(w, x), statement(w), contained_in(w, y), table(y), represents(w, v), negative_regulation(v), has_submitter(y, z), submitter(z), [...] (10 variables) Q5 = table(x), has_submitter(x, z), submitter(z), contains_statement(x, y), statement(y), contained_in(y, x) Q6 = positive_regulation(z), is_represented_by(z, y), statement(y), represents(y, z), contained_in(y, x), table(x), has_submitter(x, v), submitter(v), contains_statement(x, y).
  • 25.
    Algorithm The space ofall patterns over realistic linked data sets is virtually infinite. But there are some good search heuristics:  use precomputed „promising” building blocks;  „climb up” over the most successful queries so far (but use a restart rule to avoid getting stuck locally). 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont
  • 26.
    What’s next... The question„what’s in that linked data set?” is perhaps not the major one, but the suggested notion of interestingness might well be:  „frequency vs. semantic content” trade-off reflects the dual – graphical and logical – nature of the RDF(S) representation model.  many of the linked data mining tasks can be described as: given Q2 find an interesting Q1 such that: Q1 ⇢ Q2  other, more abstract criteria might be also necessary. Linked data mining requires novel principles and foundational approaches.