Data Science meets Linked
Data
Alasdair J G Gray
http://www.alasdairjggray.co.uk
@gray_alasdair
A.J.G.Gray@hw.ac.uk
SICSA ...
BBC World Cup
3 July 2014SICSA Data Science Theme Launch
1
BBC Linked Data Platform
3 July 2014SICSA Data Science Theme Launch
2
Olympics 2012
3 July 2014SICSA Data Science Theme Launch
3
Linking Data
3 July 2014SICSA Data Science Theme Launch
4
1. Global ID – URI
2. Resolvable ID
3. Useful content
HTML for humans
RDF for machines
4. Link to other resources
Like ...
Challenge 1: Matching
Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014
John Grant
Fisherman...
Gleevec® = Imatinib Mesylate
3 July 2014 SICSA Data Science Theme Launch 7
DrugbankChemSpider PubChem
Imatinib
MesylateIma...
Challenge 2: Reusing mappings
3 July 2014 SICSA Data Science Theme Launch 8
Link: skos:closeMatch
Reason: non-salt form
Li...
Challenge: Multiple Identities
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
l...
Challenge Open Data:
Licenses
5★ of linked data
Licenses who can
reuse the data
 Interoperability of
licenses
 Non-com...
Challenges: Privacy
11
3 July 2014SICSA Data Science Theme Launch
Challenge: Query Performance
Response time
Data freshness
Reliability
Volume of
requests
Hosting
resources
3 July 201...
In Data we Trust
How can we trust
the data we’ve got
back?
How can we ensure
that it hasn’t been
tampered on the
way?
T...
Contact Details
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
3 July 2014SICSA Data Science Theme Launch
15
...
Upcoming SlideShare
Loading in...5
×

Data Science meets Linked Data

712

Published on

What are the research and technical challenges of linked data that are relevant to data science?

This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
712
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Many of you will have visited this site recently

    Lot of sport coverage, how do the BBC cope within their resources?
  • 700+ pages on teams, groups and players
    Minimal journalist involvement
    Automatic aggregation and links to relevant stories

    Article tagged with Frank Lampard, inference used to link team, group ,etc
  • Coverage of 10,000+ athletes, 200+ countries, 400-500 disciplines and 30 venues
    Page for every athlete and country drawing on open data
  • Internally

    DBPedia and Geonames
  • Linked Data hugely successful since inception in 2009
    About 300 datasets published
    Wide range of topics
  • Familiar with birth, marriage and death records.

    Aligning individuals is hard

    Also applies to schema matching
  • Data’s been aligned now what?

    Example drug: Gleevec Cancer drug for leukemia

    Lookup in three popular public chemical databases
    Different results

    Data is messy!
  • sameAs != sameAs depends on your point of view

    Links relate individual data instances: source, target, predicate, reason.

    Links are grouped into Linksets which have VoID header providing provenance and justification for the link.

    Links need provenance to enable reuse – James’s talk
  • Each captures a subtly different view of the world

    Are they the same? … depends on your point of view

    Different URIs for different representations (content negotiation)
  • Not all data should be open

    Consider your interaction with the health service – its unique to you
    Need statistical aggregation to anonymise data

    As much about educating the public – Public relations
  • Data Science meets Linked Data

    1. 1. Data Science meets Linked Data Alasdair J G Gray http://www.alasdairjggray.co.uk @gray_alasdair A.J.G.Gray@hw.ac.uk SICSA Data Science Theme Launch 3 July 2014
    2. 2. BBC World Cup 3 July 2014SICSA Data Science Theme Launch 1
    3. 3. BBC Linked Data Platform 3 July 2014SICSA Data Science Theme Launch 2
    4. 4. Olympics 2012 3 July 2014SICSA Data Science Theme Launch 3
    5. 5. Linking Data 3 July 2014SICSA Data Science Theme Launch 4
    6. 6. 1. Global ID – URI 2. Resolvable ID 3. Useful content HTML for humans RDF for machines 4. Link to other resources Like the Web, but for data! Linked Data Principles 3 July 2014SICSA Data Science Theme Launch 5 “RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!”
    7. 7. Challenge 1: Matching Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014 John Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Grant Iain Grant Born: 1860 Messy data Probabilistic matches Schema matching
    8. 8. Gleevec® = Imatinib Mesylate 3 July 2014 SICSA Data Science Theme Launch 7 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
    9. 9. Challenge 2: Reusing mappings 3 July 2014 SICSA Data Science Theme Launch 8 Link: skos:closeMatch Reason: non-salt form Link: skos:exactMatch Reason: drug name Link: owl:sameAs
    10. 10. Challenge: Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 3 July 2014SICSA Data Science Theme Launch 9 P12047 X31045 GB:29384 http://rdf.ebi.ac.uk/resource/ch embl/molecule/CHEMBL1642 https://www.ebi.ac.uk/chembl/co mpound/inspect/CHEMBL1642
    11. 11. Challenge Open Data: Licenses 5★ of linked data Licenses who can reuse the data  Interoperability of licenses  Non-commercial: academic use, teaching, industry 3 July 2014SICSA Data Science Theme Launch 10
    12. 12. Challenges: Privacy 11 3 July 2014SICSA Data Science Theme Launch
    13. 13. Challenge: Query Performance Response time Data freshness Reliability Volume of requests Hosting resources 3 July 2014SICSA Data Science Theme Launch 12 Queries Queries
    14. 14. In Data we Trust How can we trust the data we’ve got back? How can we ensure that it hasn’t been tampered on the way? Trusty URIs 3 July 2014SICSA Data Science Theme Launch 13 http://www.intelsat.com/wp- content/uploads/2014/03/Red-padlock.jpg
    15. 15. Contact Details www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair 3 July 2014SICSA Data Science Theme Launch 15 “There is lots of data we all use every day, and it’s not part of the web. I can see my bank statements on the web, and my photographs, and I can see my appointments in a calendar. But can I see my photos in a calendar to see what I was doing when I took them? Can I see bank statement lines in a calendar? No. Why not? Because we don’t have a web of data. Because data is controlled by applications and each application keeps it to itself.” Tim Berners-Lee
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×