The life-sciences as a pathfinder in data-intensive research practice

543 views
484 views

Published on

Presentation given at UQ Winterschool 2014. The advent of the Internet is bringing about fundamental changes in the ways that research is performed and communicated. These have been particularly driven by the growing importance of data, as well as the tools available to work with this data. This presentation will examine this shift, drawing on examples from the life‐sciences, and try to make some predictions about the next five years.

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
543
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Story that is being told here – might seem initially in pieces, but there is a common thread.
    Point of first section is broad context for two case studies
  • Increasingly, Share is bleeding into Do, so let’s zoom in on this
  • Want to provide a series of snapshots of the future drawn from lifesciences
  • Sourceforge is another example
  • DNA variant of NG_000007.3 (hemoglobin)
    Sardinian population
    Provenance: authors of the article from which the nanopub was mined
  • Content: Post-publication peer review of pubs
  • Content: Post-publication peer review of pubs
  • Publons aims to change all that. Members of the site can import papers, rate them, and discuss them. In ongoing discussions, members can endorse reviews. When the endorsements reach a certain threshold, the review gains a digital object identifier (DOI), turning it into an object that can be cited in more traditional academic literature.
  • Content: Multiple sources checking the validity/classification of data
  • Content: Multiple sources checking the validity/classification of data
  • Content: Multiple sources checking the validity/classification of data
  • Could also have had this for Registration, of course
  • Content: Multiple sources checking the validity/classification of data
  • Problem of reproducibility is just part of the problem
  • Integrity used to be based on reliable archives
  • Accelerated evolution (again, like Cambrian explosion)
  • Not supported after 2016
  • Omictools, Seqanswers

    I am reminded a bit of the early days of computing and the proliferation of word processors
  • One way to think about this problem is in terms of diffusion of innovation
  • So no pressure then…
  • The life-sciences as a pathfinder in data-intensive research practice

    1. 1. The life-sciences as a pathfinder in data- intensive research practice Dr Andrew Treloar, Director of Technology 11 July 2014 CC-BY-SA, @atreloar 1
    2. 2. Structure presentation  Research Lifecycles  Functions of Scholarly Communication  Pointers to the future  Characterising the future  Pathfinder problems  Conclusions 11 July 2014 CC-BY-SA, @atreloar 2
    3. 3. So many lifecycles… 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 3
    4. 4. Minimal Research Lifecycle Think DoShare 11 July 2014 CC-BY-SA, @atreloar 4
    5. 5. Sharing: Scholarly Communication System and its Functions  Registration  Certification  Awareness  Archiving (Rosendaal and Geurts, 1997) 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 5
    6. 6. System of Journals  Registration  submission of manuscript  Certification  peer-review (pre-publication)  commentary (post-publication)  Awareness  discovery services  Archiving  libraries (print)  publishers (electronic)  special purpose organisations (e.g. Portico) 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 6
    7. 7. Pointers to the future “the future is already here – it’s just not very evenly distributed” William Gibson, NPR interview 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 7
    8. 8. Registration: BioRxiv 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 8
    9. 9. Registration: Github 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 9
    10. 10. Registration: WikiPathways 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 10
    11. 11. Registration: NeuroLex 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 11
    12. 12. Registration: Nanopublications 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 12
    13. 13. Registration: some observations  Decoupling registration from certification  Timestamping, versioning  Registration of various types of objects  Machines as creators and contributors 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 13
    14. 14. Certification: PubMed Commons 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 14
    15. 15. Certification: PubPeer 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 15
    16. 16. Certification: Publons 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 16
    17. 17. Certification: some observations  Peer-review decoupled from publication process  Certification of various types of objects  Machines validating form  Social endorsement 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 17
    18. 18. Awareness: myExperiment 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 18
    19. 19. Awareness: eLabNotebook RSS 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 19
    20. 20. Awareness: Twitter 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 20
    21. 21. Awareness: some observations  Awareness for various types of objects  Real time awareness  Awareness support targeted at machines  Awareness through social media 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 21
    22. 22. Archiving: PDB 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 22
    23. 23. Archiving: GenBank 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 23
    24. 24. Characterising the future Fixed Varying Discrete Continuous Hidden VisibleResearch Process Nature of object Process of making public Speed of communicationDelayed Instant Atomic CompoundAtomicity of object Communicated object Publication +data proxies Publication + linked data + linked models Formal InformalNature of process11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 24
    25. 25. Fundamental changes  The research process (objects, social dimension) is becoming more exposed  Articles, books are no longer the only relevant objects for research communication  Objects are no longer static  Machines are joining humans as (co- )creators and consumers of research objects 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 25
    26. 26. Pathfinder problems  Integrity of the scholarly record  The three obsolescences  hardware  file format  software 11 July 2014 CC-BY-SA, @atreloar 26
    27. 27. System of Journals: Archiving 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 27
    28. 28. Web of Objects: Archiving? 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 28
    29. 29. Not just citation relationships 11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 29
    30. 30. The problem of obsolescence  Lifescience research environment can be viewed as undergoing a process of accelerated evolution  Other disciplines will hit these problems in time 11 July 2014 CC-BY-SA, @atreloar 30
    31. 31. Cambrian explosion 11 July 2014 31
    32. 32. Hardware obsolescence: Roche 454 11 July 2014 CC-BY-SA, @atreloar 32
    33. 33. Software obsolescence: too much choice, not enough support 11 July 2014 CC-BY-SA, @atreloar 33
    34. 34. Abandonware  “Last summer, a member of the biology department of the University of Udine in Italy approached Nicola Vitacolonna with an intriguing project. The ANREP program, which annotates structural motifs in gene or protein sequences, was out of date having been written more than a decade ago. Although still used by molecular biologists, its slow computing ability meant a straightforward multiple search could take all night on a desktop PC. The Udine biologist wanted Vitacolonna, a postdoctoral fellow in computational biology, to write a program that could do the job more quickly.”  Sam Jaffe, Scientists Abandon their Software, The Scientist, Feb 16, 2004 11 July 2014 CC-BY-SA, @atreloar 34
    35. 35. File format obsolescence: Illumina  Probability of error in basecalling encoded using ascii code to reduce file size  Meaning of the ascii code changed along the life cycle and for data generated at different time points the quality might be encoded differently  “If you get an error like "Invalid quality score value", your fastq file probably has Sanger (offset 33) instead of Illumina (ASCII offset 64) quality scores. You'll need to add the option "-Q33" to your FASTX Toolkit arguments”. Obviously… 11 July 2014 CC-BY-SA, @atreloar 35
    36. 36. Everett Rogers, Diffusion of Innovation, 1962 11 July 2014 CC-BY-SA, @atreloar 36
    37. 37. Conclusions  Need to move to a smaller number of standard file formats  Need to move to a more sustainable model of software development and maintenance  Need to encourage platform manufacturers to innovate around the hardware, not the software  NOTE: other disciplines are looking to lifesciences to work out how to solve some of these problems 11 July 2014 CC-BY-SA, @atreloar 37
    38. 38. On best practices in the development of bioinformatics software, Front. Genet., 02 Jul 14  Source code available to reviewers  Software indexed, citable, available  Source code documented  Source code managed  Test libraries, sample data and dataset repositories available 11 July 2014 CC-BY-SA, @atreloar 38
    39. 39. Questions?  andrew.treloar@ands.org.au  @atreloar  https://www.slideshare.net/atreloar/the- lifesciences-as-a-pathfinder-in-dataintensive- research-practice 11 July 2014 CC-BY-SA, @atreloar 39

    ×