Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linked Data for improved organization of research data

402 views

Published on

Slides for a talk at a Farmbio BioScience Seminar May 18, 2018, at http://farmbio.uu.se introducing Linked Data as a way to manage research data in a way that can better keep track of provenance, make its semantics more explicit, and make it more easily integrated with other data, and consumed by others, both humans and machines.

Published in: Science
  • Be the first to comment

Linked Data for improved organization of research data

  1. 1. Linked Data for improved organization of research data Farmbio BioScience Seminar May 18, 2018 Samuel Lampa @smllmp PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se
  2. 2. ● Large datasets ● Automation ● Scientific workflows ● Machine Learning ● Semantic data ● Reasoning ● Query systems ● Something user friendly ● And hopefully usable ● “Answer all the (computational) research questions” Research interests
  3. 3. What’s the problem?
  4. 4. What’s the problem? ● Data in different formats ● Different data schemas ● Losing track of what data means (meaning available only in context)
  5. 5. A database to the rescue?
  6. 6. Database to the rescue? ● Same problems with losing data identity on export ● So, put all data in the same database? ● One database can’t fit all the world’s data! ● What to do?
  7. 7. What to do? What if all data could be: ● Easy to share ● Self-described ● Use the same (underlying) format ● Be easy to integrate with other data (In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
  8. 8. Linked Data!
  9. 9. Linked data – Basic ideas ● Use URI:s (“https://”) to identify things ● Make URI:s into dereferenceable links (So one can visit them to find relevant data) ● Refer to other data using their links
  10. 10. What about the linking? Triple model*: – Subject (URI), Predicate (URI), Object (URI or literal value) @ex: http://example.org/myontology/ ex:Sweden ex:hasPopulation 9000000 ex:Sweden ex:hasCapital ex:Stockholm * For more info: Check “RDF: Resource Description Framework”
  11. 11. Web links vs. Data links
  12. 12. Example data Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
  13. 13. <http://[...]/nmrshiftdb/?moleculeId=234> dc:title "warburganal"; chem:casnumber "62994-47-2"; nmr:moleculeId "234"; nmr:hasSpectrum <http://[...]/nmrshiftdb/?spectrumId=4735>; <http://[...]/nmrshiftdb/?spectrumId=4735> nmr:field "50"; nmr:hasPeak <http://[...]/nmrshiftdb/?s4735p0>, <http://[...]/nmrshiftdb/?s4735p1>, <http://[...]/nmrshiftdb/?s4735p2>, <http://[...]/nmrshiftdb/?s4735p3>; nmr:solvent "Chloroform-D1 (CDCl3)"; nmr:spectrumId "4735"; nmr:spectrumType "13C"; nmr:temperature "298". <http://[...]/nmrshiftdb/?s4735p1> nmr:hasShift 18.3; a nmr:peak. <http://[...]/nmrshiftdb/?s4735p2> nmr:hasShift 22.6; a nmr:peak. <http://[...]/nmrshiftdb/?s4735p3> nmr:hasShift 26.5; a nmr:peak. Example data Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
  14. 14. Powerful querying with SPARQL
  15. 15. What to do? - Linked Data! What if all data could be: ● Easy to share – Yep, RDF is a web based format ● Self-described – Yes, links in the data describe the data ● Use the same (underlying) format – Yes, RDF triples ● Be easy to integrate with other data - Yes, just create links (In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
  16. 16. But how to actually use this in practice?
  17. 17. What we did (1/3): Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport ← SWI-Prolog for querying … Integrated into Bioclipse Pros / Cons: + Powerful querying + Easy to integrate into other software => Powerful interactive environment + Excellent performance - No support for really large datasets (exceednig RAM size)
  18. 18. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y. Semantic MediaWiki as a collaborative and interactive platform for playing around with data, summarizing and visualizing using SMW’s Ask query language → Pros / Cons: + Collaboration supported + Versioned data storage + UI generation included in SMW - Performance concerns - Lack of expressiveness and power in the SMW “Ask” query language
  19. 19. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
  20. 20. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
  21. 21. What we did (3/3): urisolve ● A simple web server to resolve, or “dereference” URIs ● Returns any data / triples for the URI in question ● Based on data in a triplestore (semantic database) or an RDF-HDT file (compressed, indexed file format) ● Source code: github.com/pharmbio/urisolve Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O. A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
  22. 22. ● Linked Data makes data self-describing ● It is extremely flexible to work with ● Lowers the barriers to data entry Conclusions
  23. 23. Vision:A central workbench for Linked Data SWISH: SWI-Prolog Notebook: swish.swi-prolog.org … to access all data sources, and “answer all the (computational) research questions”
  24. 24. Thank you Samuel Lampa @smllmp PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se

×