Keynote talk to LEARN (LERU/H2020 project) for research data management. Emphasizes that problems are cultural not technical. Promotes modern approaches such as Git / continuousIntegration, announces DAT. Asserts that the Right to Read in the Right to Mine. Calls for widespread development of contentmining (TDM)

  1. 1. The Culture of Research Data Peter Murray-Rust, and UniversityOfCambridge LEARN, London, UK 2016-01-29 The technology for Managing Research Data is already here… …but we need a change of culture Open Notebook Science Publishers must be forced to serve us, not control us
  3. 3. My European Heroes Young People(ContentMine) NEELIE KROES
  4. 4. The Right to Read is the Right to Mine
  5. 5. Themes • Highly domain-dependent (chem, cryst, phylo) • Requires community and centrality • University repositories are NOT the solution • Openness makes it dramatically easier/better • The publisher-academic complex is a major problem. • Infrastructure must be open and under our control
  6. 6. WE pay for scholarly publications that WE can’t read [1] The Military-Industrial-Academic complex (1961) (Dwight D Eisenhower, US President) Publishers Academia Glory+? $$, MS review Taxpayer Student Researcher $$ $$ in-kind The Publisher-Academic complex[1]
  7. 7. Elsevier wants to control Open Data [asked by Michelle Brook]
  8. 8. Some topics • Github / software mgt informs data mgt • Open notebook science • Open source malaria + LabTrove • Open phylogenetics • Computational chemistry • Crystallography • Early career researchers can change the world, if we support them. • ContentMining (TDM) as research • Are “publishers” tyrants or servants?
  9. 9. Every Research Data Manager should be using Git
  10. 10. Why I reposit software in GitHub I WANT TO!!! BETTER QUICKER SECURE AUDIT BACKTRACKABLE EASY get collaborators Most early career software creators have repos How many people have USED Git?
  11. 11. Free/Open Software Development CODE REPOSITORY World community CODE rewrite validate CODE fork CODE Re-use CODE Re-use Github, BitBucket StackOverflow, Apache inspires OSI Example: ContentMine at BORN-OPEN-SOURCE NO WALLS
  12. 12. GIT housekeeps AUTOMATICALLY, eternally Daily record of commits and Merges. Can backtrack to ANY Previous version
  13. 13. Community involvement Contributions from People “outside project”
  14. 14. Compile Fail Inactive Fail Tests Pass Tests Continuous Integration (Jenkins) Every time I commit a change 50 projects are recompiled and tested. Impossible to do this manually!
  15. 15. Software management Is a success! Research DATA management Is a mess.
  16. 16. Traditional Research and Publication “Lab” work paper/th esis Write rewrite Re-experiment publish ??? Validation?? DATA output “belongs” to publisher Every process is LOSSY
  17. 17. How NOT to publish data HT Henry Rzepa From Henry Rzepa: this article which provides a 22 Mbyte PDF of data (mostly bitmaps of NMR spectra) and comes in at 404 pages long. [1] But this one [comp chem] is 505 pages long (the current record holder?) [1] DATA Behind paywall
  18. 18. 505 pages PDF, was a machine-readable log file that could and should have been in a repo Computational Chemistry
  19. 19. MORE of the PDF DATA Destruction Blind humans and Machines cannot read this
  20. 20. ALWAYS put your (computational, instrumental, observational) data directly into a repository
  21. 21. some visionaries…
  22. 22. JD Bernal’s 1965 vision However large an array of facts, however rapidly they accumulate, it is possible to keep them in order and to extract from time to time digests containing the most generally significant information, while indicating how to find those items of specialized interest. To do so, however, requires the will and the means. (Bernal, 1965) Quoted by PMR in
  23. 23. PMR’s Tribute Planned Memorial Meeting July 14th 2014 Cambridge OPEN NOTEBOOK SCIENCE
  24. 24. • Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours). • Immediate publication of finished annotated sequences. • Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. HUMAN GENOME project used Open Notebooks Without
  26. 26. Open Notebook Science, ONS Jean-Claude Bradley 2006 All data immediately available to all. NO INSIDER INFORMATION.
  27. 27. TOOLS Open Notebook Science Open engineered repository World community INSTRUMENT validate merge MODEL CODE DATA DATA knowledge calibrate Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous ; data are SEMANTIC Machines and humans Working together
  28. 28. Here are three examples
  29. 29. Mat Todd (Sydney) and MANY collaborators (Chrome for interactivity) Mat Todd, Univ Sydney, runs an Open Notebook community to create new antimalarials.
  30. 30. Notebook managed on Git.
  31. 31. Interactive OPEN chemical search tool from
  32. 32. Interactive OPEN molecular display Jmol (Bob Hanson et al)
  33. 33. Interactive OPEN chemical search tool from
  34. 34. data is associated with the proposed scientific endeavour prior to or at the point of creation rather than by annotating the data with commentary after the experiment has taken place University of Southampton
  35. 35. Data thrives on Community
  36. 36. Henry Rzepa does Open Notebook Computational Chemistry… This is a current open notebook discussion, (see comments, currently 67). … on his blog
  38. 38. Crystallography – a model for Data Management • Pro-active, friendly international community • Committed active International Union(IUCr) • Data publication valued (1960-present) • Community develops semantics/dictionaries • Committed volunteer software innovators • Heavily Open approach • Massive and valuable re-use of data • Culture of validation/reproducibility • Respect and credit for tool development
  41. 41. DATA
  43. 43. Where to reposit published crystallography? Proteins -> PDB, Open BUT Inorganics -> ICSD Closed Organics -> Cambridge (CCDC) Closed SO The community has built a Crystallography Open Database
  44. 44. Restrictions on Re-use of Crystallographic data NOTE: The CCDC is based on data contributed by scientists as part of publication and validation Crystallographic data from publications now belongs to CCDC
  45. 45. Open Source and Open Data
  46. 46. Interactive OPEN crystal search tool
  47. 47. Panton Fellows (Early Career Researchers) Panton Principles of Open Scientific Data 2010 Publish data openly (CC0) and record your wishes
  48. 48. Sophie Kershaw, Panton Fellow : Doctoral Training in Oxford
  49. 49. Sophie Kershaw, Panton Fellow
  50. 50. Rotation-Based Learning (RBL) Phase 1: Initiator • No communication permitted between groups • Attempt to reproduce existing literature • Deliver a coherent research story by the end of Phase 1 Phase 2: Successor • Communication between groups still prohibited • Validate and develop the inherited research story • Critique your predecessors • Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues? Throughout Phases 1 & 2: • Daily lectures on open science culture & techniques • First-hand application to own research work • Version control using GitHub • Daily group supervision
  51. 51. … third-year graduate students So first-year grad students should be trained by…
  52. 52. So we can now legally contentmine the whole literature in the UK… NORMA Ross Mounce and PMR created a SuperTree of Life for microorganisms! …Yes! And in UK we are starting to do it…
  53. 53.
  54. 54. CC BY-SA
  55. 55. Aves Apterygidae Marsupialia Monotremata Mammalia Reptilia Amphibia Arthropoda Myriapodia Okapia johnstoni Pyrus Stuffed Tree of Life
  56. 56. Authors don’t deposit data (Ross Mounce)
  57. 57. And we did it as Open Notebook Science all data and code on Github Discussion on public Discourse Tool NO INSIDER KNOWLEDGE
  58. 58. 4300 images in Github
  59. 59. “Root” We analysed every pixel
  60. 60. Many diagrams had author errors
  61. 61. Supertree created from 4300 papers
  62. 62. Supertree for 924 species Tree
  63. 63. So why not Git for Data?
  64. 64. DAT is Git for Data!!
  65. 65. DAT! Queen Mary UL reposits DNA
  66. 66. The John S. and James L. Knight Foundation is an American private, non-profit foundation dedicated to supporting "transformational ideas that promote quality journalism, advance media innovation, engage communities and foster the arts."[2] DAT supports public data
  67. 67. @Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism: "Elsevier stopped me doing my research" er-stopped-me-doing-my-research/ … #opencon #TDM Elsevier stopped me doing my research Chris Hartgerink
  68. 68. I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress. To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1]. In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers. Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day. Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university. I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research. [1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2 Chris Hartgerink’s blog post
  69. 69. Some Children of the Digital Enlightenment • David Carroll & Joe McArthur: OAButton • Rayna Stamboliyska & Pierre-Carl Langlais • Jon Tennant • Ross Mounce • Jenny Molloy • Erin McKiernan • Jack Andraka • Michelle Brook • Heather Piwowar • TheContentMine Team • Rufus Pollock • Jonathan Gray • Sophie Kay Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP) J-C promoted these ideas with UNDERGRADUATE scientists. [1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge Sophie Kay
  71. 71. OPEN CLOSED Zenodo Figshare Git Dat OpenOffice Word, PPT LabTrove, Chemdraw CrystallographyOpenDB Cambridge Cryst data Centre WriteLatex / Overleaf ReadCube, Symplectic,
  73. 73. Open Source software inspires Open Science Jean-Claude Bradley 2006
  74. 74. Ross Mounce (Bath), Panton Fellow • Sharing research data: • How-to figures from PLOS/One [link]: Ross shows how to bring figures to life: • PLOSOne at • PLOS at (demo)