Data Science meets Linked
Data
Alasdair J G Gray
http://www.alasdairjggray.co.uk
@gray_alasdair
A.J.G.Gray@hw.ac.uk
SICSA Data Science Theme Launch
3 July 2014
BBC World Cup
3 July 2014SICSA Data Science Theme Launch
1
BBC Linked Data Platform
3 July 2014SICSA Data Science Theme Launch
2
Olympics 2012
3 July 2014SICSA Data Science Theme Launch
3
Linking Data
3 July 2014SICSA Data Science Theme Launch
4
1. Global ID – URI
2. Resolvable ID
3. Useful content
HTML for humans
RDF for machines
4. Link to other resources
Like the Web, but for data!
Linked Data Principles
3 July 2014SICSA Data Science Theme Launch
5
“RDF and OWL do not
solve the interoperability
problem, they just lay it
bare on the table!”
Challenge 1: Matching
Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014
John Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Grant
Iain Grant
Born: 1860
Messy data
Probabilistic
matches
Schema matching
Gleevec® = Imatinib Mesylate
3 July 2014 SICSA Data Science Theme Launch 7
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Challenge 2: Reusing mappings
3 July 2014 SICSA Data Science Theme Launch 8
Link: skos:closeMatch
Reason: non-salt form
Link: skos:exactMatch
Reason: drug name
Link: owl:sameAs
Challenge: Multiple Identities
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
3 July 2014SICSA Data Science Theme Launch
9
P12047
X31045
GB:29384
http://rdf.ebi.ac.uk/resource/ch
embl/molecule/CHEMBL1642
https://www.ebi.ac.uk/chembl/co
mpound/inspect/CHEMBL1642
Challenge Open Data:
Licenses
5★ of linked data
Licenses who can
reuse the data
 Interoperability of
licenses
 Non-commercial:
academic use,
teaching, industry
3 July 2014SICSA Data Science Theme Launch
10
Challenges: Privacy
11
3 July 2014SICSA Data Science Theme Launch
Challenge: Query Performance
Response time
Data freshness
Reliability
Volume of
requests
Hosting
resources
3 July 2014SICSA Data Science Theme Launch
12
Queries Queries
In Data we Trust
How can we trust
the data we’ve got
back?
How can we ensure
that it hasn’t been
tampered on the
way?
Trusty URIs
3 July 2014SICSA Data Science Theme Launch
13
http://www.intelsat.com/wp-
content/uploads/2014/03/Red-padlock.jpg
Contact Details
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
3 July 2014SICSA Data Science Theme Launch
15
“There is lots of data we all use every day, and it’s not part of the web. I
can see my bank statements on the web, and my photographs, and I can
see my appointments in a calendar. But can I see my photos in a calendar
to see what I was doing when I took them? Can I see bank statement lines
in a calendar?
No. Why not? Because we don’t have a web of data. Because data is
controlled by applications and each application keeps it to itself.”
Tim Berners-Lee

Data Science meets Linked Data

  • 1.
    Data Science meetsLinked Data Alasdair J G Gray http://www.alasdairjggray.co.uk @gray_alasdair A.J.G.Gray@hw.ac.uk SICSA Data Science Theme Launch 3 July 2014
  • 2.
    BBC World Cup 3July 2014SICSA Data Science Theme Launch 1
  • 3.
    BBC Linked DataPlatform 3 July 2014SICSA Data Science Theme Launch 2
  • 4.
    Olympics 2012 3 July2014SICSA Data Science Theme Launch 3
  • 5.
    Linking Data 3 July2014SICSA Data Science Theme Launch 4
  • 6.
    1. Global ID– URI 2. Resolvable ID 3. Useful content HTML for humans RDF for machines 4. Link to other resources Like the Web, but for data! Linked Data Principles 3 July 2014SICSA Data Science Theme Launch 5 “RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!”
  • 7.
    Challenge 1: Matching AdministrativeData Research Centre - Scotland | Alasdair J G Gray| 3 July 2014 John Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Grant Iain Grant Born: 1860 Messy data Probabilistic matches Schema matching
  • 8.
    Gleevec® = ImatinibMesylate 3 July 2014 SICSA Data Science Theme Launch 7 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  • 9.
    Challenge 2: Reusingmappings 3 July 2014 SICSA Data Science Theme Launch 8 Link: skos:closeMatch Reason: non-salt form Link: skos:exactMatch Reason: drug name Link: owl:sameAs
  • 10.
    Challenge: Multiple Identities AndyLaw's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 3 July 2014SICSA Data Science Theme Launch 9 P12047 X31045 GB:29384 http://rdf.ebi.ac.uk/resource/ch embl/molecule/CHEMBL1642 https://www.ebi.ac.uk/chembl/co mpound/inspect/CHEMBL1642
  • 11.
    Challenge Open Data: Licenses 5★of linked data Licenses who can reuse the data  Interoperability of licenses  Non-commercial: academic use, teaching, industry 3 July 2014SICSA Data Science Theme Launch 10
  • 12.
    Challenges: Privacy 11 3 July2014SICSA Data Science Theme Launch
  • 13.
    Challenge: Query Performance Responsetime Data freshness Reliability Volume of requests Hosting resources 3 July 2014SICSA Data Science Theme Launch 12 Queries Queries
  • 14.
    In Data weTrust How can we trust the data we’ve got back? How can we ensure that it hasn’t been tampered on the way? Trusty URIs 3 July 2014SICSA Data Science Theme Launch 13 http://www.intelsat.com/wp- content/uploads/2014/03/Red-padlock.jpg
  • 15.
    Contact Details www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair 3 July2014SICSA Data Science Theme Launch 15 “There is lots of data we all use every day, and it’s not part of the web. I can see my bank statements on the web, and my photographs, and I can see my appointments in a calendar. But can I see my photos in a calendar to see what I was doing when I took them? Can I see bank statement lines in a calendar? No. Why not? Because we don’t have a web of data. Because data is controlled by applications and each application keeps it to itself.” Tim Berners-Lee

Editor's Notes

  • #3 Many of you will have visited this site recently Lot of sport coverage, how do the BBC cope within their resources?
  • #4 700+ pages on teams, groups and players Minimal journalist involvement Automatic aggregation and links to relevant stories Article tagged with Frank Lampard, inference used to link team, group ,etc
  • #5 Coverage of 10,000+ athletes, 200+ countries, 400-500 disciplines and 30 venues Page for every athlete and country drawing on open data
  • #6 Internally DBPedia and Geonames
  • #7 Linked Data hugely successful since inception in 2009 About 300 datasets published Wide range of topics
  • #8 Familiar with birth, marriage and death records. Aligning individuals is hard Also applies to schema matching
  • #9 Data’s been aligned now what? Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results Data is messy!
  • #10 sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link. Links need provenance to enable reuse – James’s talk
  • #11 Each captures a subtly different view of the world Are they the same? … depends on your point of view Different URIs for different representations (content negotiation)
  • #13 Not all data should be open Consider your interaction with the health service – its unique to you Need statistical aggregation to anonymise data As much about educating the public – Public relations