Presentation given at UQ Winterschool 2014. The advent of the Internet is bringing about fundamental changes in the ways that research is performed and communicated. These have been particularly driven by the growing importance of data, as well as the tools available to work with this data. This presentation will examine this shift, drawing on examples from the life‐sciences, and try to make some predictions about the next five years.
The life-sciences as a pathfinder in data-intensive research practice
1. The life-sciences as a
pathfinder in data-
intensive research
practice
Dr Andrew Treloar, Director of
Technology
11 July 2014 CC-BY-SA, @atreloar 1
2. Structure presentation
Research Lifecycles
Functions of Scholarly Communication
Pointers to the future
Characterising the future
Pathfinder problems
Conclusions
11 July 2014 CC-BY-SA, @atreloar 2
5. Sharing: Scholarly Communication
System and its Functions
Registration
Certification
Awareness
Archiving
(Rosendaal and Geurts, 1997)
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 5
6. System of Journals
Registration
submission of manuscript
Certification
peer-review (pre-publication)
commentary (post-publication)
Awareness
discovery services
Archiving
libraries (print)
publishers (electronic)
special purpose organisations (e.g. Portico)
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 6
7. Pointers to the future
“the future is already here – it’s
just not very evenly distributed”
William Gibson, NPR interview
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 7
13. Registration: some observations
Decoupling registration from certification
Timestamping, versioning
Registration of various types of objects
Machines as creators and contributors
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 13
17. Certification: some observations
Peer-review decoupled from publication process
Certification of various types of objects
Machines validating form
Social endorsement
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 17
21. Awareness: some observations
Awareness for various types of objects
Real time awareness
Awareness support targeted at machines
Awareness through social media
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 21
24. Characterising the future
Fixed Varying
Discrete Continuous
Hidden VisibleResearch Process
Nature of object
Process of making public
Speed of communicationDelayed Instant
Atomic CompoundAtomicity of object
Communicated object
Publication
+data proxies
Publication +
linked data +
linked models
Formal InformalNature of process11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 24
25. Fundamental changes
The research process (objects, social
dimension) is becoming more exposed
Articles, books are no longer the only
relevant objects for research
communication
Objects are no longer static
Machines are joining humans as (co-
)creators and consumers of research
objects
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 25
26. Pathfinder problems
Integrity of the scholarly record
The three obsolescences
hardware
file format
software
11 July 2014 CC-BY-SA, @atreloar 26
27. System of Journals: Archiving
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 27
28. Web of Objects: Archiving?
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 28
29. Not just citation relationships
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 29
30. The problem of obsolescence
Lifescience research environment can be viewed
as undergoing a process of accelerated evolution
Other disciplines will hit these problems in time
11 July 2014 CC-BY-SA, @atreloar 30
34. Abandonware
“Last summer, a member of the biology department of the
University of Udine in Italy approached Nicola Vitacolonna
with an intriguing project. The ANREP program, which
annotates structural motifs in gene or protein sequences,
was out of date having been written more than a decade
ago. Although still used by molecular biologists, its slow
computing ability meant a straightforward multiple search
could take all night on a desktop PC. The Udine biologist
wanted Vitacolonna, a postdoctoral fellow in
computational biology, to write a program that could do
the job more quickly.”
Sam Jaffe, Scientists Abandon their Software, The Scientist, Feb 16, 2004
11 July 2014 CC-BY-SA, @atreloar 34
35. File format obsolescence: Illumina
Probability of error in basecalling encoded using ascii
code to reduce file size
Meaning of the ascii code changed along the life cycle
and for data generated at different time points the
quality might be encoded differently
“If you get an error like "Invalid quality score value",
your fastq file probably has Sanger (offset 33) instead
of Illumina (ASCII offset 64) quality scores. You'll need
to add the option "-Q33" to your FASTX Toolkit
arguments”. Obviously…
11 July 2014 CC-BY-SA, @atreloar 35
37. Conclusions
Need to move to a smaller number of standard file
formats
Need to move to a more sustainable model of
software development and maintenance
Need to encourage platform manufacturers to
innovate around the hardware, not the software
NOTE: other disciplines are looking to lifesciences
to work out how to solve some of these problems
11 July 2014 CC-BY-SA, @atreloar 37
38. On best practices in the development of
bioinformatics software, Front. Genet., 02 Jul 14
Source code available to reviewers
Software indexed, citable, available
Source code documented
Source code managed
Test libraries, sample data and dataset repositories
available
11 July 2014 CC-BY-SA, @atreloar 38
Story that is being told here – might seem initially in pieces, but there is a common thread.
Point of first section is broad context for two case studies
Increasingly, Share is bleeding into Do, so let’s zoom in on this
Want to provide a series of snapshots of the future drawn from lifesciences
Sourceforge is another example
DNA variant of NG_000007.3 (hemoglobin)
Sardinian population
Provenance: authors of the article from which the nanopub was mined
Content: Post-publication peer review of pubs
Content: Post-publication peer review of pubs
Publons aims to change all that. Members of the site can import papers, rate them, and discuss them. In ongoing discussions, members can endorse reviews. When the endorsements reach a certain threshold, the review gains a digital object identifier (DOI), turning it into an object that can be cited in more traditional academic literature.
Content: Multiple sources checking the validity/classification of data
Content: Multiple sources checking the validity/classification of data
Content: Multiple sources checking the validity/classification of data
Could also have had this for Registration, of course
Content: Multiple sources checking the validity/classification of data
Problem of reproducibility is just part of the problem
Integrity used to be based on reliable archives
Accelerated evolution (again, like Cambrian explosion)
Not supported after 2016
Omictools, Seqanswers
I am reminded a bit of the early days of computing and the proliferation of word processors
One way to think about this problem is in terms of diffusion of innovation