Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keynote speech - Carole Goble - Jisc Digital Festival 2015

Carole Goble is a professor in the school of computer science at the University of Manchester.

In this keynote, Carole offered her insights into research data management and data centres.

  • Login to see the comments

  • Be the first to like this

Keynote speech - Carole Goble - Jisc Digital Festival 2015

  1. 1. RARE and FAIR Science: Reproducibility and Research Objects Professor Carole Goble FREng FBCS The University of Manchester, UK The Software Sustainability Institute Jisc Digital Festival, 9-10 March 2015, ICC Birmingham, UK
  2. 2. KnowledgeTurning, Flow Barriers to Cure » Access to scientific resources » Coordination and Collaboration » Flow of Information [Josh Sommer]
  3. 3. [Pettifer, Attwood]
  4. 4. VirtualWitnessing* Scientific publications: » announce a result » convince readers the result is correct “papers in experimental [and computational science] should describe the results and provide a clear enough protocol [algorithm] to allow successful repetition and extension” Jill Mesirov, Broad Institute, 2010** **Accessible Reproducible Research, Science 22January 2010,Vol. 327 no. 5964 pp. 415-416, DOI: 10.1126/science.1179653 *Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
  5. 5. Bramhall et al QUALITY OF METHODS REPORTING IN ANIMAL MODELS OF COLITIS Inflammatory Bowel Diseases, , 2015, “Only one of the 58 papers reported all essential criteria on our checklist. Animal age, gender, housing conditions and mortality/morbidity were all poorly reported…..”
  6. 6. “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship.The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995 datasets data collections standard operating procedures software algorithms configurations tools and apps codes workflows, scripts code libraries services system software infrastructure compilers, hardware Morin et al Shining Light into Black Boxes Science 13 April 2012: 336(6078) 159-160 Ince et alThe case for open computer programs, Nature 482, 2012
  7. 7. Of 50 papers randomly chosen from 378 manuscripts in 2011 that use BurrowsWheeler Aligner for mapping Illumina reads 7studies listed necessary details 26no access to primary data sets, broken links to home websites 31no s/w version, parameters, exact version of genomic reference sequence Nekrutenko &Taylor, Next-generation sequencing data interpretation: enhancing, reproducibility and accessibility, NatureGenetics 13 (2012)
  8. 8. Broken software Broken science » GeoffreyChang, Scripps Institute » Homemade data-analysis program inherited from another lab » Flipped two columns of data, inverting the electron-density map used to derive protein structure » Retract 3 Science papers and 2 papers in other journals » One paper cited by 364 The structures of MsbA (purple) and Sav1866 (green) overlap little (left) until MsbA is inverted (right). Miller A Scientist's Nightmare: Software Problem Leads to Five Retractions Science 22 December 2006: vol. 314 no. 5807 1856-1857
  9. 9. Software making practices “As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software” 2000 scientists. J.E. Hannay et al., “How Do Scientists Develop and Use Scientific Software?” Proc. ICSEWorkshop Software Eng. for Computational Science and Eng., 2009, pp. 1–8.
  10. 10. Tools, Standards Machine actionable, Formats, Reporting, Policies, Practices
  11. 11. Record and Automate Everything.
  12. 12. republic of science* regulation of science institution cores libraries *Merton’s four norms of scientific behaviour (1942) public services
  13. 13. Honest Error Science is messy Inherent Reinhart/Rogoff Austerity economics Thomas Herndon Nature Oct ’12 Zoë Corbyn Fraud
  14. 14. “I can’t immediately reproduce the research in my own laboratory. It took an estimated 280 hours for an average user to approximately reproduce the paper.” Prof Phil Bourne Associate Director, NIH Big Data 2 Knowledge Program
  15. 15. When research goes “wrong” »Tainted resources »Black boxes »Poor Reporting »Unavailable resources / results: data, software »Bad maths »Sins of omission »Poor training, sloppiness (adapted) Ioannidis, Why Most Published Research Findings Are False, August 2005 Joppa, et al,TroublingTrends inScientificSoftwareUseSCIENCE 340 May 2013 Scientific method
  16. 16. Social environment » Impact factor mania » Pressure to publish » Broken peer review » Research never reported » Disorganisation » Time pressures » Prep & curate costs When research goes “wrong” (adapted) Nick D Kim, Norman Morrison Do a Replication Study? No thanks! Not FAIR. Hard. Resource intensive. Unrecognised. Trolled. Just gathering the bits .
  17. 17. Cross-Institutional e-Laboratory Scattered parts, Subject specific / General resources Fragmented Landscape 101 Innovations in Scholarly Communication - the Changing ResearchWorkflow, Boseman and Kramer, 2015,
  18. 18.
  19. 19. Research Objects Compound Investigations, Research Products Multi-various Products, Platforms/Resources Units of exchange, commons, contextual metadata
  20. 20. First class citizens - data, software, methods - id, manage, credit, track, profile, focus A Framework to Bundle and Relate (scattered) resources Metadata Objects that carry Research Context Research Objects
  21. 21. • closed <-> open • local <-> alien • embed <-> refer • fixed <-> fluid • nested • multi –typed, stewarded, sited, authored • span research, researchers, platforms, time • cite? resolve? steward?
  22. 22. Goble, De Roure, Bechhofer, Accelerating KnowledgeTurns, I3CK, 2013 means ends driver
  23. 23. Research Object packages codes, study, and metadata to exchange descriptions of clinical study cohorts, statistical scripts, data (CKAN for the Farr Commons). STELAR Asthma e-Lab: StudyTeam for Early Life Asthma Research coded patient cohorts exchanged with NHS FARSITE system MRC funded multi-site collaboration to support safe use of patient and research data for medical research STELAR e-Lab Platform 1 Platform 2 Platform 3
  24. 24. Focus, Pivot and Profile Profile around methods, workflows, scripts, software, data, figures….
  25. 25. Focus on the figure: F1000Research Living Figures, versioned articles, in-article data manipulation R Lawrence Force2015, Vision Award Runner Up Simply data + code Can change the definition of a figure, and ultimately the journal article Colomb J and Brembs B. Sub-strains of Drosophila Canton-S differ markedly in their locomotor behavior [v1; ref status: indexed,] F1000Research 2014, 3:176 Other labs can replicate the study, or contribute their data to a meta- analysis or disease model - figure automatically updates. Data updates time-stamped. New conclusions added via versions.
  26. 26. Jennifer Schopf,Treating Data Like Software: A Case for Production Quality Data,JCDL 2012 Software-like Release paradigm Not a static document paradigm Reproduce looks backwards -> Release looks forwards » Science, methods, data change -> agile evolution » Comparisons , versions, forks & merges, dependencies » Id & Citations » Interlinked ROs
  27. 27. [Snoep, 2015]
  28. 28.
  29. 29. Personal Data Local Stores External Databases Articles Models Standards
  30. 30. Aggregated Commons Infrastructure ConsistentComparative Reporting • Design, protocols, samples, software, models…. • Just Enough Results Model • Common and specific elements
  31. 31. RO as Instrument, Materials, Method
  32. 32. RO as Instrument, Materials, Method Input Data Software Output Data Config Parameters Drummond, Replicability is not Reproducibility: Nor is it Good Science, online Peng, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
  33. 33. Public data sets My algorithm ROWorkflow as Instrument BioSTIF My data set Public software
  34. 34. recompute replicate rerun repeat re-examine repurpose recreate reuse restore reconstruct review regenerate revise recycle redo What IS reproducibility? Re: “do again”, “return to original state” regenerate figure “show A is true by doing B” verify but not falsify [Yong, Nature 485, 2012] robustness tolerance verificationcompliance validation assurance
  35. 35. 1. Science Changes. So does the Lab. BioSTIF “The questions don’t change but the answers do” Dan Reed The lab is not fixed Updated resources Uncertainty
  36. 36. Zhao, et al .Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012 2. Instruments Break, Labs Decay materials become unavailable, technicians leave Reproducibility Window » Bit rot, Black boxes » Proprietary Licenses » Clown services » Partial replication » Prepare to Repair › form or function? › preserve or sustain?
  37. 37. RO as Instrument, Materials, Method Input Data Software Output Data Config Parameters Methods (techniques, algorithms, spec. of the steps) Materials (datasets, parameters, algorithm seeds) Experiment Instruments (codes, services, scripts, underlying libraries) Laboratory (sw and hw infrastructure, systems software, integrative platforms) Setup Drummond, Replicability is not Reproducibility: Nor is it Good Science, online Peng, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
  38. 38. Research Environment submit article and move on… publish article Publication Environment
  39. 39. Research Environment publish article Publication Environment submit article and move on…
  40. 40. [Adapted Freire, 2013] transparency dependencies steps provenance portability robustness preservation access available description intelligible standards common APIs licensing standards common metadata change management versioning packaging Machine actionable Machine actionable
  41. 41. Provenance – the link between doing and reporting
  42. 42. Reproduce by Reading Archived Record, Retain the Process/Code
  43. 43. The IT Crowd, Series 3, Episode 4 The eLabVirtual Machine* (or Docker Image**) * a black box though ** Reproduce by Running: Active Instrument Retain the bits
  44. 44. The IT Crowd, Series 3, Episode 4 The Internet
  45. 45. service Science as a Service Integrative frameworks Open Source Workflows Virtual Machines Portable Packaging Portability Transparency
  46. 46. ReproZip Workflows,makefiles service Science as a Service Integrative frameworks Open Source Workflows Virtual Machines Portable Packaging
  47. 47. Metadata Objects the secret is the manifest….
  48. 48. Workflow definition Data (inputs, outputs) Parameter configs Provenance log Hettne et al Structuring research methods and data with the research object model: genomics workflows as a case study 2014 myRDM
  49. 49. Depth and Coverage Profiles NISO-JATS
  50. 50. NISO-JATS Depth and Coverage Metadata Profiles Zhao et. al. 2013
  51. 51. Method Matters Make reproducible -> Born Be smart about reproducibility Think Commons not Repository Best Practices for ScientificComputing Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009 RARE & FAIR KnowledgeTurns with Research Objects
  52. 52. Researcher. Silver bullet tools. Psychic paper.
  53. 53. Reality Check! Jorge Cham,
  54. 54. Stealthy not Sneaky reduce the friction instrumentation span RARE and FAIR OptimisingThe Neylon Equation
  55. 55. Auto-magical end-to-end Instrumentation ELNs and Authoring Platforms Sweave
  56. 56. Credit ≠ Authorship Citing what? Research Currencies
  57. 57. Training 56% Of UK researchers develop their own research software or scripts 73% Of UK researchers have had no formal software engineering training Survey of researchers from 15 RussellGroup universities conducted by SSI between August - October 2014. 406 respondents covering representative range of funders, discipline and seniority.
  58. 58.
  59. 59. BUT…… two years time when the paper is written reviewers want additional work statistician wants more runs analysis may need to be repeated post-doc leaves, student arrives new data, revised data updated versions of algorithms/codes sample was contaminated
  60. 60. Inspired by Bob Harrison • Incremental shift for infrastructure providers. • Moderate shift for policy makers and stewards. • Paradigm shift for researchers and their institutions. The Challenge
  61. 61. All the members of the Wf4Ever team Colleagues in Manchester’s Information Management Group http://www.biovel.euAlanWilliams Norman Morrison Stian Soiland-Reyes Paul Groth Tim Clark Juliana Freire Alejandra Gonzalez-Beltran Philippe Rocca-Serra Ian Cottam Susanna Sansone Kristian Garza Barend Mons Sean Bechhofer Philip Bourne Matthew Gamble Raul Palma Jun Zhao Neil Chue Hong Josh Sommer Matthias Obst Jacky Snoep David Gavaghan Rebecca Lawrence
  62. 62. Contact… Professor Carole Goble CBE FREng FBCS The University of Manchester, UK legoble