Dataset Descriptions in    Open PHACTS    Alasdair J G Gray    University of Manchester    W3C HCLS Call – 14 January 2013...
Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing                               ...
The Innovative MedicinesInitiative                          The Open PHACTS Project• EC funded public-private          • C...
User Interfaces & Applications                 Linked Data API                           Identity          Identity Linked...
Datasets and Links
ChemSpider • ChemSpider aggregates data from   over 400 sources • Central integration point for   chemicals in OPS • OPS d...
What version of ChEMBL?                                                    ~Jan 2012 • ChemSpider: EBI SDF file      – ChE...
For the record • OPS currently uses ChEMBL 13      – RDF generated from EBI database        dump      – Published at linke...
Challenges • Datasets available      – In many versions over time      – In different formats      – From many mirrors/reg...
VoID:            Vocabulary of Interlinked Datasets • Describes RDF datasets      – W3C Note: http://www.w3.org/TR/void/ •...
Provenance Vocabularies • Dublin Core Terms      – Widely used      – Terms to generic to give proper credit          • “D...
PAV: Provenance, Authoring and Versioning Vocabulary http://code.google.com/p/pav- ontology/wiki/Homepage • Easy to unders...
Dataset Descriptions in the Open Pharmacological Space14 January 2013   OPS Dataset Descriptions – A. J. G. Gray   12
Related Work • Registries: DataHub, MIRIAM      – Do not tie metadata with the data      – No checklist of attributes • Bi...
Realisation of Dataset Descriptions • Needs to be incorporated into data   publishing pipeline • Hard for publishers to pr...
Future Vision • Provide rich and accurate   provenance trail of data      – Alignment with BioDBCore          • One standa...
Thank you A.Gray@cs.man.ac.uk www.cs.man.ac.uk/~graya/ www.openphacts.org14 January 2013   OPS Dataset Descriptions – A. J...
Upcoming SlideShare
Loading in...5
×

2013 01-14 ops-dataset_descriptions

501

Published on

Alice: "What version of ChEMBL are we using?"

Bob: "Er…let me check. It's going to take a while, I'll get back to you."

This simple question took us the best part of a month to resolve and involved several individuals. Knowing the provenance of your data is essential, especially when using large complex systems that process multiple datasets.

The underlying issues of this simple question motivated us to improve the provenance data in the Open PHACTS project. We developed a guideline for dataset descriptions where the metadata is carried with the data. In this talk I will highlight the challenges we faced and give an overview of our metadata guidelines.

Presentation given to the W3C Semantic Web for Health Care and Life Sciences Interest Group on 14 January 2013.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
501
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • This is what motivated us that we need metadata in the data files
  • Specifies VoID and PAV predicatesMIM checklist
  • Open PHACTS: 28 partner9 Pharmaceuticals3 Biotechs1 Triplestore firm15 academic
  • 2013 01-14 ops-dataset_descriptions

    1. 1. Dataset Descriptions in Open PHACTS Alasdair J G Gray University of Manchester W3C HCLS Call – 14 January 2013 www.openphacts.org/specs/datadesc/Authors:Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble,Alasdair J. G. Gray, Andra Waagmeester andEgon L. Willighagen
    2. 2. Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing Repeat @ Literature Genbank Patents PubChem Databases Downloads x each company Firewalled Databases Data Integration Data Analysis Why?
    3. 3. The Innovative MedicinesInitiative The Open PHACTS Project• EC funded public-private • Create a semantic integration hub (“Open partnership for Pharmacological Space”)… pharmaceutical research • Delivering services to support on-going drug• Focus on key problems discovery programs in pharma and public domain – Efficacy, Safety, Educati • Not just another project; Leading academics in on & semantics, pharmacology and informatics, driven Training, Knowledge by solid industry business requirements Management • 13 academic partners, 9 pharmaceutical companies, 6 SMEs • Work split into clusters: • Technical Build (focus here) • Scientific Drive • Community & Sustainability The Project
    4. 4. User Interfaces & Applications Linked Data API Identity Identity Linked Data Cache Mapping Resolution Service ServiceDomainSpecific DataServices Architecture
    5. 5. Datasets and Links
    6. 6. ChemSpider • ChemSpider aggregates data from over 400 sources • Central integration point for chemicals in OPS • OPS data covers – ChEBI – ChEMBL – DrugBank14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 5
    7. 7. What version of ChEMBL? ~Jan 2012 • ChemSpider: EBI SDF file – ChEMBL 13 • Data Cache: Chem2Bio2RDF ChEMBL RDF – File downloaded May 2011 – Chem2Bio2RDF metadata webpages: ChEMBL 8 – File: ChEMBL 2 • Mapping Server: Kasabi ChEMBL RDF file – ChEMBL 1214 January 2013 OPS Dataset Descriptions – A. J. G. Gray 6
    8. 8. For the record • OPS currently uses ChEMBL 13 – RDF generated from EBI database dump – Published at linkedchemistry.info • Credit: Egon Willighagen • Soon moving to ChEMBL 15 – RDF published by EBI14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 7
    9. 9. Challenges • Datasets available – In many versions over time – In different formats – From many mirrors/registries • Files do not carry metadata • Registries – Can be out-of-date – Can contain conflicting information14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 8
    10. 10. VoID: Vocabulary of Interlinked Datasets • Describes RDF datasets – W3C Note: http://www.w3.org/TR/void/ • Metadata carried with data – Directly embedded or linked (void:inDataset) • Problems – Very generic – No checklist of requisite fields14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 9
    11. 11. Provenance Vocabularies • Dublin Core Terms – Widely used – Terms to generic to give proper credit • “Date: A point or period of time associated with an event in the lifecycle of the resource.” • PROV – New W3C standard: www.w3.org/2011/prov – Generic framework for exchanging data – Does not contain required predicates14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 10
    12. 12. PAV: Provenance, Authoring and Versioning Vocabulary http://code.google.com/p/pav- ontology/wiki/Homepage • Easy to understand predicates – http://purl.org/pav/ • Right level of granularity – Distinguishes: author/creator/curator – Captures source of data: • import/derived/accessed • version/previousVersion • Being aligned with PROV-O14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 11
    13. 13. Dataset Descriptions in the Open Pharmacological Space14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 12
    14. 14. Related Work • Registries: DataHub, MIRIAM – Do not tie metadata with the data – No checklist of attributes • BioDBCore – Checklist • Similar information captured • Includes point of contact information – Not tied to the data14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 13
    15. 15. Realisation of Dataset Descriptions • Needs to be incorporated into data publishing pipeline • Hard for publishers to provide conformant descriptions – Datasets are complex – Evolve over time – Seen as yet another burden • Validation tool provided – http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 14
    16. 16. Future Vision • Provide rich and accurate provenance trail of data – Alignment with BioDBCore • One standard to rule them all – Automatic pipeline from VoID file to registries • Write once, use many times14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 15
    17. 17. Thank you A.Gray@cs.man.ac.uk www.cs.man.ac.uk/~graya/ www.openphacts.org14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 16
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×