STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or
Why We Need Reconciliation
April 4th, 2016
TH E AAC / G E T T Y WO R K S H O P O N
R E C O N C I L I AT I O N O F L I N K E D OP E N D ATA
Rob Sanderson / azaroth@stanford.edu / @azaroth42
STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or
Why We Need Reconciliation
April 4th, 2016
TH E AAC / G E T T Y WO R K S H O P O N
R E C O N C I L I AT I O N O F L I N K E D OP E N D ATA
Rob Sanderson / azaroth@stanford.edu / @azaroth42
web.stanford.edu/~azaroth/#me
azaroth42@gmail.com / +azaroth42
orcid: 0000-0003-4441-6852
STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or
Why We Need Reconciliation
April 4th, 2016
T H E A A C / G E T T Y W O R K S H O P O N
Rob Sanderson / azaroth@stanford.edu / @azaroth42
web.stanford.edu/~azaroth/#me
azaroth42@gmail.com / +azaroth42
orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert
http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600
www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/
rsanderson@lanl.gov / azaroth@liv.ac.uk
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth
rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz
R E C O N C I L I AT I O N O F L I N K E D O P E N D ATA
Linked Data?
1.  Use URIs as names for things
2.  Use HTTP URIs so that people can look up those names
3.  When someone looks up a URI, provide useful
information, using the standards
4.  Include links to other URIs, so they can discover
more things
Linked Data?
1.  Use URIs as names for things
2.  Use HTTP URIs so that people can look up those names
3.  When someone looks up a URI, provide useful
information, using the standards
4.  Include links to other URIs, so they can discover more
things
5.  Link your data to other people's data to provide
context
Why So Many?
Do I know the URI, or can I find it?
URI
No
Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
URI
Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
URI
Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
URI
Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
Agree description is complete?
No
URI
Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
Agree description is complete?
No
Hooray, you reused a URI! URI
Yes
Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
Agree description is complete?
No
Hooray, you reused a URI!
Now start again with the next entity :(
URI
Yes
Many Special and Unique Snowflakes
Become a Huge Snowball of Technical Debt
Option 1: Balance the Equation
Cost(Create URI)!
+!
Cost(Maintain URI) !
!
Cost(Find Good URI)+
Cost(Understand Model)+
Cost(Understand Content)

+!
min( Risk(Reliability)+!
Cost(Network Latency),!
Risk(Out of Date)+!
Cost(Cache Content))

-!
Value(Connected Graph)!
<=
Option 1 Likelihood
Option 1 Likelihood
Botticelli: http://vocab.getty.edu/ulan/500015254!
Option 1 Likelihood
Botticelli: http://vocab.getty.edu/ulan/500015254 :)!
Option 1 Likelihood
Botticelli: http://vocab.getty.edu/ulan/500015254!
:(
Option 2: Reconciliation
YCBA's URIs Princeton's URIs
Option 2: Reconciliation
YCBA's
Entities
Princeton's
Entities
Shared Entities but not Shared URIs
Option 2: Reconciliation
1. Algorithmically discover this intersection
given the descriptions of the entities
Option 2: Reconciliation
2. Assert that the entity which two URIs identify
is actually the same entity
=
Option 2: Reconciliation
Option 2a: Reconciliation
(distributed authority)
Option 2b: Reconciliation
(centralized authority)
Benefits of Reconciliation
End User:
•  Has access to more information, more easily, improving research,
discovery and navigation
•  Potential for new UIs, new research questions, reasoning
Institution:
•  Efficiency (= reduced cost) and improved quality of description
•  Increased prestige when descriptions are reused
•  Usage across the network is valuable business intelligence
Community:
•  Network effects spread faster and further, increasing awareness of
cultural heritage
•  Gives easier access to other communities' data
Real Benefit of Reconciliation
Reconciliation is a network damage limiting step
towards balancing Equation 1
By linking entity descriptions together:
•  the cost of discovery and understanding is reduced
•  the costs of creating and maintaining the resources are shared
across the community, not duplicated
•  the value of the connected graph is increased
•  the likelihood of new entities (requiring reconciliation) is reduced
But How Can A Machine Know??
Algorithms won't be perfect, but can be good enough.
•  What use cases will the reconciled data be used to fulfill?
•  What is the cost of a false positive for those use cases?
Precision: What % of matches are correct?
Recall: What % of the possible matches were found?
Can make trade-offs of precision vs recall for different use cases.
Machine can record its certainty, and policy can provide a threshold.
How Can We Improve It?
Several different relationships to express similarity:
•  owl:sameAs – always exactly the same (transitive)
•  skos:exactMatch – the same for most purposes (transitive)
•  skos:closeMatch – the same for some purposes (intransitive)
The context of resource in the network is important
•  Starting simple with high precision gives a better context to use the
results to iteratively and incrementally bootstrap
Trust and Community
"Efficiency (= reduced cost) and improved quality of description"
•  Efficiency comes from not duplicating descriptive effort...
•  Which requires trusting other institutions in the community
•  We need to work together, not...
Trust and Community
"Efficiency (= reduced cost) and improved quality of description"
•  Efficiency comes from not duplicating descriptive effort...
•  Which requires trusting other institutions in the community
•  We need to work together, not...
Entities to Reconcile
As a community, we need to pick where to start.
Suggest starting with least controversial / most unique:
•  Physical objects
•  People
•  Places
•  Events (specific, like Exhibitions)
A small sub-domain (by time?) to make overlap more likely
Q. Can I Reconcile a String?
Named Entity Recognition
"snowflake" = .
strings to things
Reconciliation
. = .
things to things
The Hard Question
How can we be more useful than DBPedia
for our own entities?
The Hard Question
How can we be more useful than DBPedia
for our own entities?
•  Focus on unique selling points
•  Demonstrate value early,
both internally and to the broader community
•  By working together to increase the value of the network
STANFORD UNIVERSITY LIBRARIES
Thank You!
April 4th, 2016
Rob Sanderson / azaroth@stanford.edu / @azaroth42
web.stanford.edu/~azaroth/#me
azaroth42@gmail.com / +azaroth42
orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert
http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600
www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/
rsanderson@lanl.gov / azaroth@liv.ac.uk
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth
rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz
STANFORD UNIVERSITY LIBRARIES
Thank You!
April 4th, 2016
Rob Sanderson / azaroth@stanford.edu / @azaroth42
web.stanford.edu/~azaroth/#me
azaroth42@gmail.com / +azaroth42
orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert
http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600
www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/
rsanderson@lanl.gov / azaroth@liv.ac.uk
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth
rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz
STANFORD UNIVERSITY LIBRARIES
Thank You!
April 4th, 2016
azaroth@stanford.edu
STANFORD UNIVERSITY LIBRARIES
Thank You!
April 4th, 2016
azaroth@stanford.edu
Thank	You!	
rsanderson@ge*y.edu		
April 25th, 2016
Thank	You!	
rsanderson@ge*y.edu		
Based on my slides from Andrew W. Mellon Foundation Reconciliation Workshop
With recognition and thanks to all of the participants

Linked Data Snowball, or Why We Need Reconciliation

  • 1.
    STANFORD UNIVERSITY LIBRARIES TheLinked Data Snowball or Why We Need Reconciliation April 4th, 2016 TH E AAC / G E T T Y WO R K S H O P O N R E C O N C I L I AT I O N O F L I N K E D OP E N D ATA Rob Sanderson / azaroth@stanford.edu / @azaroth42
  • 2.
    STANFORD UNIVERSITY LIBRARIES TheLinked Data Snowball or Why We Need Reconciliation April 4th, 2016 TH E AAC / G E T T Y WO R K S H O P O N R E C O N C I L I AT I O N O F L I N K E D OP E N D ATA Rob Sanderson / azaroth@stanford.edu / @azaroth42 web.stanford.edu/~azaroth/#me azaroth42@gmail.com / +azaroth42 orcid: 0000-0003-4441-6852
  • 3.
    STANFORD UNIVERSITY LIBRARIES TheLinked Data Snowball or Why We Need Reconciliation April 4th, 2016 T H E A A C / G E T T Y W O R K S H O P O N Rob Sanderson / azaroth@stanford.edu / @azaroth42 web.stanford.edu/~azaroth/#me azaroth42@gmail.com / +azaroth42 orcid: 0000-0003-4441-6852 http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert http://academic.research.microsoft.com/Author/2765999 http://www.scopus.com/authid/detail.url?authorId=8988953600 www.researchgate.net/profile/Rob_Sanderson facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/ rsanderson@lanl.gov / azaroth@liv.ac.uk public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz R E C O N C I L I AT I O N O F L I N K E D O P E N D ATA
  • 4.
    Linked Data? 1.  UseURIs as names for things 2.  Use HTTP URIs so that people can look up those names 3.  When someone looks up a URI, provide useful information, using the standards 4.  Include links to other URIs, so they can discover more things
  • 5.
    Linked Data? 1.  UseURIs as names for things 2.  Use HTTP URIs so that people can look up those names 3.  When someone looks up a URI, provide useful information, using the standards 4.  Include links to other URIs, so they can discover more things 5.  Link your data to other people's data to provide context
  • 7.
    Why So Many? DoI know the URI, or can I find it? URI No
  • 8.
    Why So Many? DoI know the URI, or can I find it? No Understand and agree with the model used? No URI
  • 9.
    Why So Many? DoI know the URI, or can I find it? No Understand and agree with the model used? No Understand and agree with the description? No URI
  • 10.
    Why So Many? DoI know the URI, or can I find it? No Understand and agree with the model used? No Understand and agree with the description? No Agree the URI identifies the same entity? No URI
  • 11.
    Why So Many? DoI know the URI, or can I find it? No Understand and agree with the model used? No Understand and agree with the description? No Agree the URI identifies the same entity? No Agree description is complete? No URI
  • 12.
    Why So Many? DoI know the URI, or can I find it? No Understand and agree with the model used? No Understand and agree with the description? No Agree the URI identifies the same entity? No Agree description is complete? No Hooray, you reused a URI! URI Yes
  • 13.
    Why So Many? DoI know the URI, or can I find it? No Understand and agree with the model used? No Understand and agree with the description? No Agree the URI identifies the same entity? No Agree description is complete? No Hooray, you reused a URI! Now start again with the next entity :( URI Yes
  • 14.
    Many Special andUnique Snowflakes
  • 15.
    Become a HugeSnowball of Technical Debt
  • 16.
    Option 1: Balancethe Equation Cost(Create URI)! +! Cost(Maintain URI) ! ! Cost(Find Good URI)+ Cost(Understand Model)+ Cost(Understand Content)
 +! min( Risk(Reliability)+! Cost(Network Latency),! Risk(Out of Date)+! Cost(Cache Content))
 -! Value(Connected Graph)! <=
  • 17.
  • 18.
    Option 1 Likelihood Botticelli:http://vocab.getty.edu/ulan/500015254!
  • 19.
    Option 1 Likelihood Botticelli:http://vocab.getty.edu/ulan/500015254 :)!
  • 20.
    Option 1 Likelihood Botticelli:http://vocab.getty.edu/ulan/500015254! :(
  • 21.
    Option 2: Reconciliation YCBA'sURIs Princeton's URIs
  • 22.
  • 23.
    Option 2: Reconciliation 1.Algorithmically discover this intersection given the descriptions of the entities
  • 24.
    Option 2: Reconciliation 2.Assert that the entity which two URIs identify is actually the same entity =
  • 25.
  • 26.
  • 27.
  • 28.
    Benefits of Reconciliation EndUser: •  Has access to more information, more easily, improving research, discovery and navigation •  Potential for new UIs, new research questions, reasoning Institution: •  Efficiency (= reduced cost) and improved quality of description •  Increased prestige when descriptions are reused •  Usage across the network is valuable business intelligence Community: •  Network effects spread faster and further, increasing awareness of cultural heritage •  Gives easier access to other communities' data
  • 29.
    Real Benefit ofReconciliation Reconciliation is a network damage limiting step towards balancing Equation 1 By linking entity descriptions together: •  the cost of discovery and understanding is reduced •  the costs of creating and maintaining the resources are shared across the community, not duplicated •  the value of the connected graph is increased •  the likelihood of new entities (requiring reconciliation) is reduced
  • 30.
    But How CanA Machine Know?? Algorithms won't be perfect, but can be good enough. •  What use cases will the reconciled data be used to fulfill? •  What is the cost of a false positive for those use cases? Precision: What % of matches are correct? Recall: What % of the possible matches were found? Can make trade-offs of precision vs recall for different use cases. Machine can record its certainty, and policy can provide a threshold.
  • 31.
    How Can WeImprove It? Several different relationships to express similarity: •  owl:sameAs – always exactly the same (transitive) •  skos:exactMatch – the same for most purposes (transitive) •  skos:closeMatch – the same for some purposes (intransitive) The context of resource in the network is important •  Starting simple with high precision gives a better context to use the results to iteratively and incrementally bootstrap
  • 32.
    Trust and Community "Efficiency(= reduced cost) and improved quality of description" •  Efficiency comes from not duplicating descriptive effort... •  Which requires trusting other institutions in the community •  We need to work together, not...
  • 33.
    Trust and Community "Efficiency(= reduced cost) and improved quality of description" •  Efficiency comes from not duplicating descriptive effort... •  Which requires trusting other institutions in the community •  We need to work together, not...
  • 34.
    Entities to Reconcile Asa community, we need to pick where to start. Suggest starting with least controversial / most unique: •  Physical objects •  People •  Places •  Events (specific, like Exhibitions) A small sub-domain (by time?) to make overlap more likely
  • 35.
    Q. Can IReconcile a String? Named Entity Recognition "snowflake" = . strings to things Reconciliation . = . things to things
  • 36.
    The Hard Question Howcan we be more useful than DBPedia for our own entities?
  • 37.
    The Hard Question Howcan we be more useful than DBPedia for our own entities? •  Focus on unique selling points •  Demonstrate value early, both internally and to the broader community •  By working together to increase the value of the network
  • 38.
    STANFORD UNIVERSITY LIBRARIES ThankYou! April 4th, 2016 Rob Sanderson / azaroth@stanford.edu / @azaroth42 web.stanford.edu/~azaroth/#me azaroth42@gmail.com / +azaroth42 orcid: 0000-0003-4441-6852 http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert http://academic.research.microsoft.com/Author/2765999 http://www.scopus.com/authid/detail.url?authorId=8988953600 www.researchgate.net/profile/Rob_Sanderson facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/ rsanderson@lanl.gov / azaroth@liv.ac.uk public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz
  • 39.
    STANFORD UNIVERSITY LIBRARIES ThankYou! April 4th, 2016 Rob Sanderson / azaroth@stanford.edu / @azaroth42 web.stanford.edu/~azaroth/#me azaroth42@gmail.com / +azaroth42 orcid: 0000-0003-4441-6852 http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert http://academic.research.microsoft.com/Author/2765999 http://www.scopus.com/authid/detail.url?authorId=8988953600 www.researchgate.net/profile/Rob_Sanderson facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/ rsanderson@lanl.gov / azaroth@liv.ac.uk public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz
  • 40.
    STANFORD UNIVERSITY LIBRARIES ThankYou! April 4th, 2016 azaroth@stanford.edu
  • 41.
    STANFORD UNIVERSITY LIBRARIES ThankYou! April 4th, 2016 azaroth@stanford.edu
  • 42.
  • 43.
    Thank You! rsanderson@ge*y.edu Based on myslides from Andrew W. Mellon Foundation Reconciliation Workshop With recognition and thanks to all of the participants