Learning to assess Linked Data relationships using Genetic Programming

Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta
Learning to Assess
Linked Data Relationships
Using Genetic Programming
@IlaTiddi
20.10.2016
15th International Semantic Web Conference (ISWC 2016)

Research Problem
Automatically discover what makes a strong relationship
between two entities in (the Web of) Linked Data.
• relationship : a semantic path between two entities
ASongOfIceAnd
Fire(novel)
GoTASongOfIce
AndFire(topic)
dc:subject dc:subject

Research Problem
Automatically discover what makes a strong relationship
between two entities in (the Web of) Linked Data.
• relationship : a semantic path between two entities
• automatically : through graph search techniques
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
:born
:airedIn
dc:subjectdc:subject
Fantasy

Research Problem
Problem
• Entities/properties in a path might come from a number
of different, unknown data sources
Solution (the easy one)
• indexing & preprocessing of a portion of Linked Data
• a priori knowledge, computational resources
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
:born
:airedIn
Fantasy

Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
GoT

Research Problem
Solution
ASongOfIceAnd
Fire(novel)
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
dc:subject
Fantasy
dc:subject

Research Problem
Solution
ASongOfIceAnd
Fire(novel)
GoTASongOfIce
AndFire(topic)
dc:subject
Fantasy
dc:subject
UnitedStates:bornGeorgeRRMartin
:author

Research Problem
Solution
ASongOfIceAnd
Fire(novel)
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
dc:subject
Fantasy
dc:subject
UnitedStates:born

Research Problem
Solution
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
Fantasy
dc:subject
:born

Research Problem
Solution
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author :airedIn
Fantasy
:born

Research Problem
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author :airedIn
Fantasy
Solution
:born

Research Hypothesis
Problem
Uninformed searches require a cost-function to explore the
graph following the most promising paths
Hypo
Linked Data information can drive a cost-function that
detects strong relationships between entities
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author :airedIn
Fantasy
:born

Research Questions
What makes a path strong?
• Which topological or semantic features of nodes/edges?
✗ e.g. length of a path?
 entities of different datasets are connected by many paths
of similar length
How can we use Linked Data to assess strong relationships?
• Which information do we need?
• Can we use structural features of the graph?
Challenges
• find topological/semantic features to detect strong relationships
• combine these features in a cost-function
• perform an effective blind search

Proposed Approach
• A set of topological/semantic characteristics of
the Linked Data graph
• a benchmark of human-evaluated relationship
paths
Identify the cost-function for a blind search that
best performs in ranking sets of alternative
relationship paths
Automatically learn a cost-function to detect strong
relationships between Linked Data entities using a
supervised method (Genetic Programming)

Proposed Approach
Genetic Programming: why?
• Flexible learning process
• Suitable for wide search spaces (such as Linked Data)
• Results assessed with a fitness (scores vs. functions)
• Human-understandable results
• Easy to integrate in a graph search
Automatically learn a cost-function to detect strong
relationships between Linked Data entities using a
supervised method (Genetic Programming)
VS

Genetic Programming
Programs (solutions for a problem)
• trees of primitives
• functions : internal nodes (mathematical or logical
operations)
• terminals : leaf nodes (constants or variables)
Fitness function (evaluation)
• how well the program solves the problem
Genetic operations (evolution)
• reproduction
• crossover from two parents
• mutation from one parent
Termination condition
• maximum number of evolutions
• a desired fitness

Genetic Programming
Procedure
• Create random population of programs based on the primitives
• Evolve population until an ideal situation is met
✗✗
✗
✔✔✗✗ ✔
canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti

Genetic Programming
Given
• a starting population of randomly generated cost-functions
• sets of alternative paths between two Linked Data entities,
ranked by humans
Determine how good each cost-function is in ranking paths
compared to the human evaluators
✗✗
✗
✔✔✗✗ ✔
canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti

Genetic Programming
Primitives
Constant terminals
• Z= {0, 1000}
Aggregated terminals
• Topological edge weighs
indegree, outdegree, constant weight
• Semantic edge weighs
usage of namespaces, taxonomies, vocabularies
• Aggregators along the path
sum, avg, min, max
Functions (combining different information)
• Math operations
addition, multiplication, division, log

Genetic Programming
Fitness
Normalised Discounted Cumulative Gain (nDCG)
• (IR) quality of rankings provided by search engines based on
the graded relevance of the returned documents
• how good is a program in ranking paths based on human ranks
• avg(nDCG) across the dataset
• length penalty
Genetic operations
• Reproduction
• Crossover
• Mutation
Learning
• Training set + test set
• Keep fittest program for each runs on training set
• Test them (discard inconsistent)

Experiments
Dataset
Entities (random types from different sources)
• 12,630 events from Yago
• 8,185 people from the VIAF dataset
• 999 movies from the LMDB
• 1,174 countries/capitals from Geonames/ the UNESCO dataset
Paths (a set of possible paths between them)
• select a random pair
• bidirectional breadth-first search
Assessment
• 100 pairs (~10 possible paths per pair)
• 8 judges
• from (2) highly relevant to (0) not relevant
db:Dina-
Korzun
viaf:Dina-
Korzun
gn:Europe
gn:United-
Kingdom
lmdb:The
SkinGame
owl:sameAsdbo:citizenship
gno:parent
Feature
foaf:based
_near

Experiments
Results
Different runs (fitness on training set/test set)
(T) Topological primitives only
(S) Topological + semantic primitives
(N) Topological + namespaces primitives
Runs Best program Fitness TR Fitness TS
T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79
T2 log(min.cd)/(avg.cd + 87) 0.77 0.78
T3 min.cd × (min.cd/max.cd) 0.78 0.72
N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81
N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77
N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75
S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83
S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86
S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86

Experiments
Results
Lower performance for T-runs and N-runs
Recurrent terminals
• conditional degree (node degree depending on the RDF triple)
• namespace variety
• number of topic properties (dc:subject/skos:broader/foaf:primaryTopic)
Runs Best program Fitness TR Fitness TS
T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79
T2 log(min.cd)/(avg.cd + 87) 0.77 0.78
T3 min.cd × (min.cd/max.cd) 0.78 0.72
N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81
N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77
N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75
S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83
S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86
S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86

Experiments
Comparative evaluation
Best programs
• automatically learnt
vs. literature functions
• RECAP,RelFinder,Everything Is Connected Engine, Moore et al.
• ad-hoc / handcrafted information theoretical measures

Experiments
Which cost-function?
Interpretation
• pass through nodes with rich node descriptions
higher min_namespaces = higher path score
• not high level entities / few topic categories
few incoming topic categories = higher path score
• more specific entities (not hubs) for path with few topic categories
ratio conditional_degree / inTopicCategories
 specific paths are privileged over general paths
min_namespaces+
min_conditionalDegree
log(log(sum_inTopicCategories))

Conclusions
Contributions
A measure to detect strong relationships in Linked Data
 can be integrated in uninformed searches over Linked Data
vs. indexing/pre-processing techniques
 derived empirically through Genetic Programming
vs. domain-specific / handcrafted measures
 what is important in Linked Data
topological features + little knowledge about the edge vocabulary
Future work
• Integrate the measure in the blind-search process
• Explore more characteristics
• Improve the measure

THANK YOU VERY MUCH
(AND DO NOT MESS UP WITH ITALIAN FOOD)
Questions?
IlaTiddi ilaria.tiddi@open.ac.uk

Learning to assess Linked Data relationships using Genetic Programming

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Learning to assess Linked Data relationships using Genetic Programming

Similar to Learning to assess Linked Data relationships using Genetic Programming (20)

More from Vrije Universiteit Amsterdam

More from Vrije Universiteit Amsterdam (14)

Recently uploaded

Recently uploaded (20)

Learning to assess Linked Data relationships using Genetic Programming

Editor's Notes