Your SlideShare is downloading. ×
B4OS-2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

B4OS-2012

338
views

Published on

Bioinformatics for Omics Sciences (B4OS), …

Bioinformatics for Omics Sciences (B4OS),
CNR Naples, 25-17 Sep 2012

Published in: Education, Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
338
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data management and curation: the other side of bioinformatics Susanna-Assunta Sansone, PhD Principal Investigator and Team Leader, University of Oxford e-Research Centre, Oxford, UK http://uk.linkedin.com/in/sasansonehttp://www.slideshare.net/SusannaSansone/B4OS-2012 Bioinformatics for Omics Sciences (B4OS), CNR Naples, 25-17 Sep 2012
  • 2. Oxford e-Research Centre
  • 3. Oxford e-Research Centre
  • 4. Oxford e-Research Centre Providing research computing, high- performance computing Integrating with national and international infrastructure Supporting leading edge facilities through education and training
  • 5. Oxford e-Research Centre Collaborating with European and wider international groups in, e.g.: •  energy, •  radio astronomy, •  biological data federation, •  life sciences simulation, •  biodiversity, •  computational chemistry, •  neuroscience, •  digital humanities tools, •  digital music analysis Research in •  computation, •  data infrastructure and analysis, •  visualisation
  • 6. My team’s activities and groups we work withdata management, biocuration, development of software, databases and community-driven standards and ontology env   agro   tox/pharma   health  
  • 7. http://www.flickr.com/photos/12308429@N03/4957994485/ CC BY
  • 8. Today:“The buzz around reproducible bioscience data -the policies, the communities and the standards” Thursday: “The reality from the buzz: how to deliver reproducible bioscience data”
  • 9. Preserve institutional / corporate memoryHarmonize collection across sites Find matching studies Data dissemination Long-term data stewardship 9
  • 10. Utilizepublic dataIdentify suitable data RetrieveCurate and harmonize Re-analyze 10
  • 11. Addressreproducibility / reuse of public data 11
  • 12. Addressreproducibility / reuse of public data 12
  • 13. Addressreproducibility / reuse of public data Ioannidis et al., Repeatability of published microarray gene expression analyses. Nature Genetics 41(2), 13 149-55 (2009) doi:10.1038/ng.295
  • 14. Addressreproducibility / reuse of public data 14 14
  • 15. Addressreproducibility / reuse of public data 15 15
  • 16. Addressreproducibility / reuse of public data 16 16
  • 17. Growing, worldwide movement for reproducible research Shared, annotated research data and methods offer new discovery opportunities and prevent unnecessary repetition of work. Improved data sharing underpins science of the future “Publicly-funded research data are a public good, produced in the public interest” “Publicly-funded research data should be openly available17 to the maximum extent possible” The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 18. http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
  • 19. Reproducible & ReusableBioscience Research
  • 20. reasoning visualizationanalysis browsing integration exchange retrieval Well-annotated & Structured Data Reproducible & Reusable Bioscience Research
  • 21. reasoning visualization analysis browsing integration exchange retrievalCommunity SoftwareStandards Tools Well-annotated & Structured Data Reproducible & Reusable Bioscience Research
  • 22. Today’s bioscience research Publications Experimental and computational data§  Is interdisciplinary and integrative in character •  need to deal with new and existing datasets •  deal with a variety of data types§  ‘How the organism works’ is the focus •  Twenty years ago data was the center Source of the figure: EBI website
  • 23. Example from the toxicogenomics domain Study looking at the effect of a compound inducing liver damage by characterizing/measuring - the metabolic profile by MS and NMR - protein expression in liver by MS - gene expression by DNA microarray -  conducting genetic and phenotypical analysis Information contributing to the construction and validation of system biology models
  • 24. Example of experiments by InnoMed PredTox24 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 a FP6 public-private consortium Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 25. Structured description of datasets §  Capture all salient features of the experimental workflow §  Make annotation explicit and discoverable §  Structure the descriptions for consistency, tracking §  independent variables §  dependent variables using §  cross reference and resolvable identifiers
  • 26. Not too much, not too little, just ‘right’ §  We must strike a balance between •  depth and breadth of information; and •  sufficient information required to reuse the data
  • 27. Information intensive experiments
  • 28. Information intensive experiments To make the experiments comprehensible and reusable, underpinning future investigations, we need common ways to report and share the experimental details and the associated data. Consistent reporting will have a positive and long-lasting impact on the value of collective scientific outputs.
  • 29. Common ways to report and share§ The challenges we face •  Large in volume: lots of data types and metadata! •  Lots of free text descriptions: hard to mine, subject to mistakes! •  Babel of terminologies: lack of definitions, hard to map! •  Heterogeneous file formats: software lock-in!§ Need for reporting standards •  Minimal reporting descriptors - Report the same ‘core essentials’ •  Controlled vocabularies or ontology - Use the same word and mean the same thing •  Common exchange formats - Make tools interoperable, allow data exchange and integration
  • 30. Reporting standards – the benefits§  Describe and communicate the information to others, in an unambiguous manner§  To unlock the value in the data •  Compare, query and evaluate data - Facilitate scientific validation of the findings •  Understand variability within/between different technologies and protocols -  Facilitate technical validation -  Enable optimization of the experimental designs -  Identify critical checkpoints and develop quality metrics§  To define submission and/or publication requirements •  Journals •  Databases§  To ensure data integrity, reproducibility and (re)use
  • 31. Escalating number of standardization efforts in bioscience, e.g.: Genomics StandardsGenome annotation Consortium (GSC)www.geneontology.org gensc.org Functional Enzymology dataGenomics Data standardsSociety (FGED) www.strenda.org www.fged.org HUPO- Proteomics Standards Initiative (PSI) Systems modelling http://www.psidev.info standards www.sbml.org Cheminformatics www.ebi.ac.uk/chebi Pathways www.biopax.org Metabolomics Standards Initiative (MSI) http://www.metabolomicssociety.org
  • 32. Different community, different norms and standards, e.g.: use the same word and allow data to flow from report the same core, refer to the same ‘thing’ one system to another essential information
  • 33. Different community, different norms and standards, e.g.: use the same word and allow data to flow from report the same core, refer to the same ‘thing’ one system to another essential information
  • 34. Different community, different norms and standards, e.g.: use the same word and allow data to flow from report the same core, refer to the same ‘thing’ one system to another essential information Challenges:lack of coordination, fragmentation and uneven coverage
  • 35. Is this ‘general mobilization’ good or bad? use the same word and allow data to flow from report the same core, refer to the same ‘thing’ one system to another essential information§  Difference in structures and processes: •  organization types (open, close to members, society, WG…) •  standards development (how to design, develop, evaluate, maintain…) •  adoption, uptake, outreach (link to journals, funders, commercial sector…) •  funds (sponsors, memberships, grants, volunteering…)
  • 36. Is this ‘general mobilization’ good or bad? use the same word and allow data to flow from report the same core, refer to the same ‘thing’ one system to another essential information§  Fragmentation of the standards is a major issue •  Being focused on particular communities’ interests, be their individual technologies or biological/biomedical disciplines, leads to duplication of effort, and more seriously, the development of (largely arbitrarily) different standards •  This severely hinders the interoperability of databases and tools and ultimately the integration of datasets
  • 37. Growing number of reporting standards MAGE-Tab! AAO! miame! GCDML! MIAPA! CHEBI! SRAxml! OBI! MIRIAM! VO! SOFT! MIQAS! FASTA! PATO! MIX! CML! ENVO! REMARK! DICOM! MIGEN! GELML! MOD! SBRML! MIAPE! MIQE! TEDDY! MITAB! MzML! XAO! CIMR! CONSORT! BTO!ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!
  • 38. Growing number of reporting standards + 303 + 150 + 130 Source: MIBBI, Source: BioPortal EQUATOR Estimated Databases, annotation, curation tools MAGE-Tab! AAO! miame! GCDML! MIAPA! CHEBI! SRAxml! OBI! MIRIAM! VO! SOFT! MIQAS! FASTA! PATO! MIX! CML! ENVO! REMARK! DICOM! MIGEN! GELML! MOD! SBRML! MIAPE! MIQE! TEDDY! MITAB! MzML! XAO! CIMR! CONSORT! BTO!ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!
  • 39. But how much do we know about these standards MAGE-Tab! AAO! miame! GCDML! MIAPA! CHEBI! SRAxml! OBI! MIRIAM! VO! SOFT! MIQAS! FASTA! PATO! MIX! CML! ENVO! REMARK! DICOM! MIGEN! GELML! MOD! SBRML! MIAPE! MIQE! TEDDY! MITAB! MzML! XAO! CIMR! CONSORT! BTO!ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!
  • 40. But how much do we know about these standards Which tools and I use high throughput databases sequencing technologies, implement which which one are applicable standards? to me? How can I get What are the involved tocriteria to evaluate propose their status and extensions or value? modifications? Which one are I work on plants, mature enough for are these just for me to use or biomedical recommend? applications?
  • 41. But how much do we know about these standards§  A bewildering array of standards is available, but •  these are hard to find, at different levels of maturity; in some areas duplications or gaps in coverage also exist§  Standards are just a ‘means to an end’, therefore •  we want to make them discoverable and accessible, maximizing their use to assist the virtuous data cycle, from generation to standardization through publication to subsequent sharing and reuse
  • 42. A catalogue to map the landscape of standards and the systems implementing them: Over 400 bio-standards (public and in curation) Field*, Sansone* et al., Omics data sharing. Science42 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone 326, 234-36 (2009) doi:0.1126/science.1180598 www.ebi.ac.uk/net-project
  • 43. •  A coherent, curated and searchable catalogue of data sharing resources•  Bioscience standards and associated data-sharing policies, publications, tools and databases•  Assessment criteria for usability and popularity of standards•  Relationships among standards•  Encouragement for communication & interaction among groups•  Promoting interoperability & informed decisions about standards
  • 44. Example of multi-assays study – how many ‘standards’ are applicable to this?
  • 45. Example of multi-assays study – how many ‘standards’ are applicable to this?
  • 46. Example of multi-assays study – how many ‘standards’ are applicable to this?
  • 47. Example of multi-assays study – how many ‘standards’ are applicable to this?
  • 48. Smith et al, 2007The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 49. Smith et al, 2007Taylor, Field, Sansone et al, 2008 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 50. List of databases, linked to standards a collaboration with Database Issue50 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
  • 51. List of databases, linked to standards a collaboration with Database Issue51 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
  • 52. List of databases, linked to standards a collaboration with Database Issue52 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
  • 53. Major challenge: define ‘relations’ among standards CREDIT: The relationship among popular standard formats for pathway information Demir, et al., The BioPAX BioPAX and PSI-MI are designed for data exchange to and from databases and community standard for pathway and network data integration. SBML and CellML are designed to pathway data sharing, support mathematical simulations of biological systems and SBGN represents 2010. pathway diagrams.53 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 54. This is not just a technical but also a social engineering challenge!55 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 55. Ownership of open standards can be problematic in broad, grass-root collaborations; it requires improved models, to encourage maintenance of and contributions to these efforts, supporting their evolutions56 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 56. The extensive ‘social engineering’ and community liaison needs to be managed and funded; rewards and incentives need to be identified for all contributors57 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 57. http://www.flickr.com/photos/idiolector/289490834/ CC BY
  • 58. The cost of implementing a standards-supported data sharing vision is as large as the number of stakeholders that must operate synchronously60 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 59. 1. Funders actively developing data policies§  Several data preservation, management and sharing policies have emerged in response to increased funding for omics domains§  Even if in general terms, standards are recognized as necessary ‘tools’ to unambiguously represent, describe and communicate research data
  • 60. 2. Similar trend in the regulatory arena§  “… lack of standardized data affects CDER’s review processes by curtailing a reviewer’s ability to perform integral tasks such as rapid acquisition, storage, analysis......efficient management of a portfolio of standards projects will require coordinated efforts and clear roles for multiple participants within/outside FDA”
  • 61. 3. Publishes have become strong advocators§  Continue to support the development of open standards and tools •  to support sharing of sufficiently well annotated datasets65 •  to enable comprehensible, reusable, www.ebi.ac.uk/net-project research reproducible The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
  • 62. ….the rise of data-driven journals, e.g.: partnering with:
  • 63. The rise of data-driven journals, e.g.: partnering with:
  • 64. 4. Similar trend in the commercial sector§  R&D has invested heavily in procedures and tools that integrate external information with their own data to enhance the decision-making process•  Now joining forces to streamline non-competitive elements of the life science workflow by the specification of common standards, business terms, relationships and processes
  • 65. ....their information landscape is evolving Yesterday Today Tomorrow Proprietary Public content content provider provider Big Life Science Big Life CRO Academic Company Science group Company Regulatory authorities Service provider Software vendor Yesterday Today TomorrowInnovation Innovation inside Searching for Innovation Heterogeneity of collaborations; part of the wider ecosystemModelIT Internal apps & data Struggling with change Cloud, services security and trustData Mostly inside In and out DistributedPortfolio Internally driven and owned Partially shared Shared portfolio Credit to: Pistoia Alliance
  • 66. Take home messagesu  Contribute to the reproducible research movementu  Think about data management as a career pathu  Learn more about open community-standardsu  Get involved, e.g.: Open Bioinformatics Foundation
  • 67. Data is not like a $ bill….http://www.flickr.com/photos/jackofspades/4500411648/ CC BY
  • 68. Your research and all (publicly funded) research should make make an … impact http://www.flickr.com/photos/equinoxefr/2620239993/ CC BY73 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 69. …..the biggest possible impact! http://www.flickr.com/photos/webhamster/2582189977/ CC BY74 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 70. Today:“The buzz around reproducible bioscience data -the policies, the communities and the standards” Thursday: “The reality from the buzz: how to deliver reproducible bioscience data”
  • 71. Is it possible to achieve a common, structuredrepresentation of diverse bioscience experiments that:•  follows the appropriate community standards and•  delivers richly-annotated datasets?
  • 72. Tim Berners-Lee’s 5-star deployment scheme for Linked Open Data
  • 73. Increasing level of structureNotes in Lab Books Spreadsheets and Tables Facts as RDF statements(information for humans) ( the compromise) (information for machines) TOWARDS INTEROPERABLE BIOSCIENCE DATA doi:10.1038/ng.1054 Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A, Feb 2012 Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B, www.biosharing.org www.isacommons.org Wolstencroft K, Xenarios J, Hide W. www.isacommons.org
  • 74. References1. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K,Ireland A, Mungall CJ; OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA,Scheuermann RH, Shah N, Whetzel PL, Lewis S: The OBO Foundry: coordinated evolution ofontologies to support biomedical data integration. Nat Biotechnol 25(11):1251-1255 (2007)2. Taylor CF,* Field D*, Sansone SA*, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA,Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J,Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK Jr,Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, et al.: Promoting coherentminimum reporting guidelines for biological and biomedical investigations: the MIBBI project.Nat Biotechnol 26(8):889-896 (2008)3. Field D*, Sansone SA*, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P,Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Megascience. Omics data sharing.Science 326(5950):234-236 (2009)4. Harland L, Larminie C, Sansone SA, Popa S, Marshall MS, Braxenthaler M, Cantor M,Filsell W, Forster MJ, Huang E, Matern A, Musen M, Saric J, Slater T, Wilson J, Lynch N, WiseJ, Dix I: Empowering industrial research with shared biomedical vocabularies. Drug DiscovToday 16(21-22):940-947 (2011)5. Sansone SA and Rocca-Serra P: On the evolving portfolio of community-standards and datasharing policies: turning challenges into new opportunities. GigaScience 1:10 (2012)