Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Standards - Workshop, ExpBio, Boston, 2015

http://experimentalbiology.org/2015/Home.aspx

  • Be the first to comment

Big Data Standards - Workshop, ExpBio, Boston, 2015

  1. 1. ! ! Big Data Standards: how to set the bar?! ! ! Susanna-Assunta Sansone, PhD! ! @biosharing! @isatools! ! Experimental Biology, Big Data Workshop, 28 March, 2015 Data Consultant, Honorary Academic Editor Associate Director, Principal Investigator http://www.slideshare.net/SusannaSansone
  2. 2. https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ Credit to:
  3. 3. A community mobilization for “openness”
  4. 4. Is open data understandable, reusable? “Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results”
  5. 5. Is open data understandable, reusable? Not always…but why? •  Outputs are multi-dimensional, diverse, not always well cited / stored •  Software, codes, workflows etc.; hard(er) to get hold of •  Data often distributed and fragmented to fit (siloed) databases o  Not contain enough information for others to understand it •  Uneven level of details and annotation across different databases o  Specialized, generalist, public and institutional •  Data curation activities are perceived as time consuming o  Collection and harmonization of detailed methods and experimental steps is done/rushed at publication stage
  6. 6. Not just open, but FAIR data
  7. 7. Responsibilities lie across several stakeholder groups Understand the benefits of sharing FAIR datasets and enact them Engage and assist researchers to enable them to share FAIR datasets Release or endorse practices and polices, but also incentive and credit mechanisms for researchers, curators and developers
  8. 8. Rise of a data-centric enterprise, e.g.:
  9. 9. Not just data, but FAIR digital research objects
  10. 10. •  We need to report sufficient information to reuse the dataset •  We must strike a balance between depth and breadth of information Without context data is meaningless
  11. 11. Information intensive experiments •  Not too much •  Not too little •  But just right
  12. 12. And conversely…. LS1_C2_LD_TP2_P1! file1.gz!
  13. 13. …how not to report the experimental information! •  L!S1 ! !liver sample 1! •  C2 ! !compound 2! •  LD ! !low dose! •  TP2 ! !time point 2! •  P1 ! !protocol 1! •  file1.gz! !compressed data file with ! ! ! !phenotypic and other information ! ! !on this sample! Sample name (?!)! Data file! LS1_C2_LD_TP2_P1! file1.gz!
  14. 14. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 1 4 •  make annotation explicit and discoverable •  structure the descriptions for consistency •  ensure/regulate access •  deposit and publish •  etc…. •  To make any dataset ‘FAIR’, one must have standards, tools and best practices to: §  report sufficient details §  capture all salient features of the experimental workflow
  15. 15. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 1 5 …breadth and depth ! of the experimental context! …is pivotal ! …and has to be both human and machine readable!
  16. 16. nature.com/scientificdata A new category of publication that provides detailed descriptors of scientifically valuable datasets. They are a highly effective link between traditional research articles and data repositories Introducing the Data Descriptor
  17. 17. Research papers Data records Data Descriptors To add value to research articles and data records
  18. 18. ! ! ! Experimental metadata or! structured component! (in-house curated, machine- readable format)! Article or ! narrative component! (PDF and HTML)! Data Description narrative and structured components
  19. 19. A curated, structured component - why? •  Supplements the scientific discourse! o  natural language has a degree of ambiguity! •  Brings clarity in reporting research methods and procedures! o  no trimming, no cooking! o  clear samples to data files links and relation to methods! •  Provides the basis for search and discovery features! SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content SciData DD Structured content Same tissue Same organism Same assay Community Data Repositories
  20. 20. Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared… From natural language to ‘computable’ concepts Data Curation Editor Responsible for creating the structured component, ensuring that the most appropriate metadata is being captured.
  21. 21. Age value Unit Strain name Subject of the experiment Type of diet and experimental condition Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts
  22. 22. Age value Unit Strain name Subject of the experiment Type of diet and experimental condition Anatomy part Seven week old C57BL/6N mice were treated with low-fat diet. Liver was dissected out, hepatocytes prepared … From natural language to ‘computable’ concepts Type of protocol – cell preparation Type of protocol - sample treatment Type of protocol – liver preparation
  23. 23. Including minimum information reporting requirements, or checklists to report the same core, essential information Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’ Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another Community-developed content standards To structure and enrich the description of datasets, facilitating understanding, sharing and reuse!
  24. 24. de jure de facto grass-roots groups standard organizations Community mobilization, some examples •  Structural and operational differences §  organization types (open, close to members, society, WG etc.) §  standards development (how to formulate, conduct and maintain) §  adoption, uptake, outreach (link to journals, funders and commercial sector) §  funds (sponsors, memberships, grants, volunteering)
  25. 25. ~ 156 ~ 70 ~ 334 miame! MIAPA! MIRIAM! MIQAS! MIX! MIGEN! ARRIVE! MIAPE! MIASE! MIQE! MISFISHIE….! REMARK! CONSORT! MAGE-Tab! GCDML! SRAxml! SOFT! FASTA! DICOM! MzML! SBRML! SEDML…! GELML! ISA-Tab! CML! MITAB! AAO! CHEBI! OBI! PATO! ENVO! MOD! BTO! IDO…! TEDDY! PRO! XAO! DO VO! In the life sciences…..almost 600! Databases, ! annotation,! curation ! tools ! implementing ! standards!
  26. 26. A web-based, curated and searchable registry ensuring that standards are registered, informative and discoverable; monitoring their development and evolution and their use in databases, and the adoption of both in data policies. Launched Jan 2011
  27. 27. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project Core functionalities: •  search and filtering, e.g. by funder •  submissions forms to add new records •  “claim” functionality of existing records •  person’s profile (as maintainer of records) associated to the ORCID profile (for credit, as incentive) •  visualization and views of content Search, filter, claim, view and more
  28. 28. Assists users to make informed decisions
  29. 29. Advisory Board and Working Group - core members and adopters Operational Team
  30. 30. The relationship among popular standard formats for pathway information. ! Demir, et al., The BioPAX community standard for pathway data sharing, Nat Biotech. 2010. Standards as an area of research - still a lot to do! E.g.: 1. Create relation or “usage maps and guides”, e.g.: 2. Metrics of maturity, usability and popularity 3. Embed in the ecosystem of complementary registries
  31. 31. 31 Technologically-delineated views of the world
 ! Biologically-delineated views of the world! Generic features ( common core )! - description of source biomaterial! - experimental design components! Arrays! Scanning! Arrays &
 Scanning! Columns! Gels! MS! MS! FTIR! NMR! Columns! transcriptomics proteomics metabolomics plant biology epidemiology microbiology To compare and integrate data we need interoperable standards How do we address fragmentation, duplications gaps?
  32. 32. Global alliances are needed, e.g.:
  33. 33. biocaddie.org
  34. 34. metadatacenter.org
  35. 35. •  Most researchers understand the value of standardized descriptions, when using third-party datasets! ! •  But when asked to structure their datasets, they view requests for even “minimal” information as burdensome! re is an urgent need to lower the bar for authoring good metadata! Researchers hate standards!
  36. 36. •  Most researchers understand the value of standardized descriptions, when using third-party datasets! ! •  But when asked to structure their datasets, they view requests for even “minimal” information as burdensome! ! Ø  There is an urgent need to lower the bar for authoring good metadata! Researchers hate standards!

×