TriplingTriplingBioinformaticsBioinformaticsProductivityProductivityProductivityProductivityProductivityProductivityJerven...
© 2013 SIBThank you
© 2013 SIBUniProt.rdfUniProt.rdf SPARQL
© 2013 SIBUniProt.rdfUniProt.rdf SPARQL
© 2013 SIB
© 2013 SIBData first• Biocuration– Recover information ‘lost’ in papers• curation ≠ data entry– Extract knowledge from dat...
© 2013 SIBBiocuration
© 2013 SIB• And the rock gets– larger every dayBiocuration
© 2013 SIBMADNESS !THIS iS Swiss-Prot !
© 2013 SIB63% more triples in a year
© 2013 SIBMake data retrieval worthwhile• If your data is not easily accessible, then no one willquery it.• Simple would b...
© 2013 SIBSPARQL?Give me abetterpipette
© 2013 SIBVisualization is work
© 2013 SIBVisualization is work
© 2013 SIB
© 2013 SIBwww.ebi.ac.uk/fgpt/gwas/
© 2013 SIBUniProt.rdfSPARQLCSVSERVICEUniProt.rdf SPARQL18
© 2013 SIBSPARQLorCLAYCLAYCLAY
© 2013 SIBProgression of query languagesSQLXPathXQuerySPARQLStandardized1986-20111999-20082008-2013SPARQL
© 2013 SIBSQL is not standardized• 7th ISO standard version• Yet...– SHOW TABLES– SELECT table_name FROM user_tables– LIST...
© 2013 SIBXPath/Xquery• Fully standardized– Also in the marketplace• Tree-based document query model– Assumes all data is ...
© 2013 SIBSPARQL• Fully standardized– Also in the marketplace• Graph-based document query model– Assumes all data is reach...
© 2013 SIBSPARQL against• RDBMS– R2RML -> D2RQ, Ultrawrap, XSPARQL...• Programs– SADI...• Triplestore– Mark logic, OWLIM, ...
© 2013 SIBUniProt.rdfSPARQLUniProt.rdf SPARQLCSVSERVICE
© 2013 SIBSPARQL against CSVbed filebed filechr7 127471196 127472363 Pos1 0 + 127471196127472363chr7 127472363 127473530 P...
© 2013 SIBSPARQL against CSV• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thi...
© 2013 SIBSPARQL against CSV• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thi...
© 2013 SIBSPARQL against CSV• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thi...
© 2013 SIBSesameText@OverridepublicCloseableIteration getStatements(Resource subj,URI pred,Value obj,Resource... namedgrap...
© 2013 SIBBig(0) compared to other approaches• If the SPARQL engine:– detects query is per CSV “line”• O(number of lines)–...
© 2013 SIB• Strengths– Isolates data format from querying– Easy to put data on the web• (public SPARQL endpoints)– Single ...
© 2013 SIBDoing this in PERLwget ftp://ftp.ncbi...human_9606/VCF/00-All.vcf.gztabix 00-All.vcf.gz -B target_locations.bed ...
© 2013 SIBSELECT ?patientSnp ?dbSnp ?maf {?patientSnp a ?mutationType ;faldo:begin ?patientBegin ;faldo:end ?patientEnd ;r...
© 2013 SIBAt your SERVICE
© 2013 SIBSELECT ?doi ?citatingDoiWHERE{uniprot:P06280 up:annotation ?annotation ;up:citation ?citation .?citation dc:iden...
© 2013 SIBBenefits of SERVICE• In a world where data keeps growing– upload a 1KB query = cheap– download a 500GB dataset =...
© 2013 SIBNetwork of SPARQL endpoints• Like a social network– value increases the more members there are
© 2013 SIBNetwork of SPARQL endpoints• Like a social network– value increases the more members there are
4242
Upcoming SlideShare
Loading in …5
×

Biohackathon2013: Tripling Bioinformatics Productivity

4,275 views

Published on

Talking about RDF/SPARQL and what it means for bioinformatics. The main point is that SPARQL is an universal API to data.

Published in: Education, Technology
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total views
4,275
On SlideShare
0
From Embeds
0
Number of Embeds
3,157
Actions
Shares
0
Downloads
14
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide
  • Before I start I would like to thank everyone at the DBCLS for inviting me to speak today at the 4th Biohackathon, especially Dr KOHARA and Professor YONEZAWA
  • Talk two things uniprot.rdf SPARQL!
  • Talk two things uniprot.rdf SPARQL!
  • Good data needs good people. We need to update these photos more often
  • No matter how sparqly it is, quality is the primary concern curation is not data entry. To much expensive curation time is wasted with data entry. Aim is to summarize biological knowledge
  • sisyphean curse. Spend all day getting rock up the mountain. Go to bed, wake up find rock at bottom of mountain and start over again.
  • Biologist publish more and more. New methods generate more information.
  • 27 years of protein paper slaying
  • In 364 days! Doubling time 15 months instead of 18 months! Information growth is faster than entry growth! 250% in 18 months instead of 200%
  • Visualization i.e. the really hard stuff 1st productivity boon, being able to use some one else’s work. Stand on the shoulders of giants etc...
  • SPARQL does not make a biologist happy It makes you happier so you can make the biologist happy
  • No matter which query/storage technology you use representing knowledge takes skill and effort.
  • James Malone at EBI http://jamesmalonee bi.blogspot.ch/2012/09/bringing-genome-wide-associations-to.html
  • Talk two things uniprot.rdf Quality!
  • Everything possible with SPARQL is possible with Clay tablets Information stays information Only difference is number of slaves, um I mean PhD students you need Clay is more expensive than FLASH ;)
  • SQL first standardized (in its relational algebra V1 form) XPath is for tree based data. If your data is not in a tree bad luck
  • And Oracle+IBM like it that way!
  • SPARQL is for the most general data format directed graphs can be translated to the simpler forms Tree simple graph, tables simple graph
  • No matter what query language you currently use: Translating from SPARQL is possible Data storage is decoupled from querying Only speed for some query types is affected
  • Talk two things uniprot.rdf Quality!
  • Implicit linear graphs in every line
  • These relations should be captured using known semantics e.g. a simple ontology.
  • To make a fully SPARQL 1.1 read compliant endpoint this is all you need to do. Sure its empty you don’t get any results but this is the first step.
  • Major strength not waking up to this
  • Talk two things uniprot.rdf Quality!
  • SPARQL Service for federated querying is a game changer
  • Find papers citing papers discussed in a disease annotated uniprot protein. Much cheaper than downloading all data and loading it into our own datawarehouse
  • Data volume are growing and growing having local copies is more and more difficult.
  • Remember the first web pages, which had no where to link too?
  • And of course many more sparql private and public endpoints outside of these organisations
  • Biohackathon2013: Tripling Bioinformatics Productivity

    1. 1. TriplingTriplingBioinformaticsBioinformaticsProductivityProductivityProductivityProductivityProductivityProductivityJerven BollemanDeveloperUniProtKB/Swiss-Prot
    2. 2. © 2013 SIBThank you
    3. 3. © 2013 SIBUniProt.rdfUniProt.rdf SPARQL
    4. 4. © 2013 SIBUniProt.rdfUniProt.rdf SPARQL
    5. 5. © 2013 SIB
    6. 6. © 2013 SIBData first• Biocuration– Recover information ‘lost’ in papers• curation ≠ data entry– Extract knowledge from data• Structuring knowledge– to integrate with related data– to answer further questions
    7. 7. © 2013 SIBBiocuration
    8. 8. © 2013 SIB• And the rock gets– larger every dayBiocuration
    9. 9. © 2013 SIBMADNESS !THIS iS Swiss-Prot !
    10. 10. © 2013 SIB63% more triples in a year
    11. 11. © 2013 SIBMake data retrieval worthwhile• If your data is not easily accessible, then no one willquery it.• Simple would be nice, but:– you cannot make it simpler than your data– if the biology is difficult, so is your database• After retrieval you must:– visualize– summarize
    12. 12. © 2013 SIBSPARQL?Give me abetterpipette
    13. 13. © 2013 SIBVisualization is work
    14. 14. © 2013 SIBVisualization is work
    15. 15. © 2013 SIB
    16. 16. © 2013 SIBwww.ebi.ac.uk/fgpt/gwas/
    17. 17. © 2013 SIBUniProt.rdfSPARQLCSVSERVICEUniProt.rdf SPARQL18
    18. 18. © 2013 SIBSPARQLorCLAYCLAYCLAY
    19. 19. © 2013 SIBProgression of query languagesSQLXPathXQuerySPARQLStandardized1986-20111999-20082008-2013SPARQL
    20. 20. © 2013 SIBSQL is not standardized• 7th ISO standard version• Yet...– SHOW TABLES– SELECT table_name FROM user_tables– LIST TABLES• Schemas are not fully transferable– VARCHAR2 or VARCHAR or CHAR or TEXT...SPARQL
    21. 21. © 2013 SIBXPath/Xquery• Fully standardized– Also in the marketplace• Tree-based document query model– Assumes all data is in one documentSPARQL
    22. 22. © 2013 SIBSPARQL• Fully standardized– Also in the marketplace• Graph-based document query model– Assumes all data is reachable via the internet– Assumes nothing about the storage modelSPARQL
    23. 23. © 2013 SIBSPARQL against• RDBMS– R2RML -> D2RQ, Ultrawrap, XSPARQL...• Programs– SADI...• Triplestore– Mark logic, OWLIM, uRiKA, Oracle spatial or NoSQL...• Key-value– Redis• Bioinformatics flat file formats– sparql-bed• CSV/TSV/Spreadsheets– Tarql, SparqlifySPARQL
    24. 24. © 2013 SIBUniProt.rdfSPARQLUniProt.rdf SPARQLCSVSERVICE
    25. 25. © 2013 SIBSPARQL against CSVbed filebed filechr7 127471196 127472363 Pos1 0 + 127471196127472363chr7 127472363 127473530 Pos2 0 + 127472363127473530chr7127471196127472363pos10 +127472363127473530pos2SPARQL
    26. 26. © 2013 SIBSPARQL against CSV• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)• CSV is a relation between fields via headersSPARQLchr7 127471196 127472363 Pos1 0 + 127471196127472363chr7 127472363 127473530 Pos2 0 + 127472363127473530Start End
    27. 27. © 2013 SIBSPARQL against CSV• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)• CSV is a relation between fields via headersSPARQLchr7 127471196 127472363 Pos1 0 + 127471196127472363chr7 127472363 127473530 Pos2 0 + 127472363127473530faldo:start faldo:enda faldo:ExactPosition
    28. 28. © 2013 SIBSPARQL against CSV• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)• CSV is a relation between fields via headersSPARQLchr7 127471196 127472363 Pos1 0 + 127471196127472363chr7 127472363 127473530 Pos2 0 + 127472363127473530?start ?end
    29. 29. © 2013 SIBSesameText@OverridepublicCloseableIteration getStatements(Resource subj,URI pred,Value obj,Resource... namedgraph)throws QueryEvaluationException {return new EmptyIteration();}
    30. 30. © 2013 SIBBig(0) compared to other approaches• If the SPARQL engine:– detects query is per CSV “line”• O(number of lines)– else• O(number of lines * number of joins)• Same as– cat | perl -ne
    31. 31. © 2013 SIB• Strengths– Isolates data format from querying– Easy to put data on the web• (public SPARQL endpoints)– Single point of optimization• e.g. parallel query execution– Other programs can still access data• Weaknesses– Time to code SPARQL to CSV translation– Latency– Harder to hack the code to see what is going on• (no pipe > to temporary file)
    32. 32. © 2013 SIBDoing this in PERLwget ftp://ftp.ncbi...human_9606/VCF/00-All.vcf.gztabix 00-All.vcf.gz -B target_locations.bed | perl-aneBEGIN{%patient=split /(S+n)/s,`cat target_locations.bed`}$alt_bases = $patient{"$F[0]t$F[1]t".($F[1]+length($F[3])-1)."t"};chomp $alt_bases;print join("t", @F[0..4], $1), "n" if $F[4] eq$alt_bases and /MAF=(d.d+)/
    33. 33. © 2013 SIBSELECT ?patientSnp ?dbSnp ?maf {?patientSnp a ?mutationType ;faldo:begin ?patientBegin ;faldo:end ?patientEnd ;rdf:value ?patientValue .?mutationType rdfs:subClassOf :mutation .SERVICE<ftp://ftp.ncbi.../human_9606/VCF/00-All.vcf.gz>{?dbSnp a ?mutationType ;faldo:begin ?patientBegin ;faldo:end ?patientEnd ;rdf:value ?patientValue ;:MinorAlleleFrequency ?maf .}}Doing this in SPARQL
    34. 34. © 2013 SIBAt your SERVICE
    35. 35. © 2013 SIBSELECT ?doi ?citatingDoiWHERE{uniprot:P06280 up:annotation ?annotation ;up:citation ?citation .?citation dc:identifier ?doiRaw ;up:name "Nature" .?annotation a up:Disease_Annotation .BIND (substr(?doiRaw, 5) as ?doi)SERVICE<http://data.nature.com/sparql>{?article prism:doi ?doi ;nature:hasCitation ?citationCitingCitation .?citationCitingCitation prism:doi ?citatingDoi}}
    36. 36. © 2013 SIBBenefits of SERVICE• In a world where data keeps growing– upload a 1KB query = cheap– download a 500GB dataset = expensive• SPARQL viable via the web– 400GB of UniProt data can stay at UniProt– Your NGS data can stay in your data centre• Easiest data compression is avoiding a 100 copies ;)
    37. 37. © 2013 SIBNetwork of SPARQL endpoints• Like a social network– value increases the more members there are
    38. 38. © 2013 SIBNetwork of SPARQL endpoints• Like a social network– value increases the more members there are
    39. 39. 4242

    ×