Linked Data Snowball, or Why We Need Reconciliation
1. STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or
Why We Need Reconciliation
April 4th, 2016
TH E AAC / G E T T Y WO R K S H O P O N
R E C O N C I L I AT I O N O F L I N K E D OP E N D ATA
Rob Sanderson / azaroth@stanford.edu / @azaroth42
2. STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or
Why We Need Reconciliation
April 4th, 2016
TH E AAC / G E T T Y WO R K S H O P O N
R E C O N C I L I AT I O N O F L I N K E D OP E N D ATA
Rob Sanderson / azaroth@stanford.edu / @azaroth42
web.stanford.edu/~azaroth/#me
azaroth42@gmail.com / +azaroth42
orcid: 0000-0003-4441-6852
3. STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or
Why We Need Reconciliation
April 4th, 2016
T H E A A C / G E T T Y W O R K S H O P O N
Rob Sanderson / azaroth@stanford.edu / @azaroth42
web.stanford.edu/~azaroth/#me
azaroth42@gmail.com / +azaroth42
orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert
http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600
www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/
rsanderson@lanl.gov / azaroth@liv.ac.uk
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth
rds23@student.canterbury.ac.nz / azaroth@es-net.co.nz
R E C O N C I L I AT I O N O F L I N K E D O P E N D ATA
4. Linked Data?
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful
information, using the standards
4. Include links to other URIs, so they can discover
more things
5. Linked Data?
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful
information, using the standards
4. Include links to other URIs, so they can discover more
things
5. Link your data to other people's data to provide
context
8. Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
URI
9. Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
URI
10. Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
URI
11. Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
Agree description is complete?
No
URI
12. Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
Agree description is complete?
No
Hooray, you reused a URI! URI
Yes
13. Why So Many?
Do I know the URI, or can I find it?
No
Understand and agree with the model used?
No
Understand and agree with the description?
No
Agree the URI identifies the same entity?
No
Agree description is complete?
No
Hooray, you reused a URI!
Now start again with the next entity :(
URI
Yes
28. Benefits of Reconciliation
End User:
• Has access to more information, more easily, improving research,
discovery and navigation
• Potential for new UIs, new research questions, reasoning
Institution:
• Efficiency (= reduced cost) and improved quality of description
• Increased prestige when descriptions are reused
• Usage across the network is valuable business intelligence
Community:
• Network effects spread faster and further, increasing awareness of
cultural heritage
• Gives easier access to other communities' data
29. Real Benefit of Reconciliation
Reconciliation is a network damage limiting step
towards balancing Equation 1
By linking entity descriptions together:
• the cost of discovery and understanding is reduced
• the costs of creating and maintaining the resources are shared
across the community, not duplicated
• the value of the connected graph is increased
• the likelihood of new entities (requiring reconciliation) is reduced
30. But How Can A Machine Know??
Algorithms won't be perfect, but can be good enough.
• What use cases will the reconciled data be used to fulfill?
• What is the cost of a false positive for those use cases?
Precision: What % of matches are correct?
Recall: What % of the possible matches were found?
Can make trade-offs of precision vs recall for different use cases.
Machine can record its certainty, and policy can provide a threshold.
31. How Can We Improve It?
Several different relationships to express similarity:
• owl:sameAs – always exactly the same (transitive)
• skos:exactMatch – the same for most purposes (transitive)
• skos:closeMatch – the same for some purposes (intransitive)
The context of resource in the network is important
• Starting simple with high precision gives a better context to use the
results to iteratively and incrementally bootstrap
32. Trust and Community
"Efficiency (= reduced cost) and improved quality of description"
• Efficiency comes from not duplicating descriptive effort...
• Which requires trusting other institutions in the community
• We need to work together, not...
33. Trust and Community
"Efficiency (= reduced cost) and improved quality of description"
• Efficiency comes from not duplicating descriptive effort...
• Which requires trusting other institutions in the community
• We need to work together, not...
34. Entities to Reconcile
As a community, we need to pick where to start.
Suggest starting with least controversial / most unique:
• Physical objects
• People
• Places
• Events (specific, like Exhibitions)
A small sub-domain (by time?) to make overlap more likely
35. Q. Can I Reconcile a String?
Named Entity Recognition
"snowflake" = .
strings to things
Reconciliation
. = .
things to things
37. The Hard Question
How can we be more useful than DBPedia
for our own entities?
• Focus on unique selling points
• Demonstrate value early,
both internally and to the broader community
• By working together to increase the value of the network