3. Invited to reminisce!
…and perhaps inform the BRAIN2050 initiative.
Note for the young: “bioinformatics” and “systems biology”
are now simply “biology”.
Monday, July 11th, 2039
4. The 20-teens and onwards
1. Too Much Data: The Datapocalypse
2. Great results, seen once: the reproducibility crisis.
3. Mind the gap: computation in biology.
Monday, July 11th, 2039
6. Too… much… data…
Between –omics, automated sensor data, and data
sharing, biology grew into a data-intensive science.
Volume, velocity, variety: the general problem.
But also!
Biology was optimized for hypothesis-driven investigation,
not data exploration!
Long arguments over “which is better”, with the people
who controlled the funding => winning.
Monday, July 11th, 2039
7. HTC, not HPC
For lots of data, High Throughput Computing was needed –
but compute was cheap, not throughput!
Monday, July 11th, 2039
Figure from bbc.co.uk
9. The reproducibility crisis -
why??
Well known fact among biotech that the majority of
published experiments were largely lab-specific.
Neither career incentives nor funding were there! (In
fact, quite the contrary…)
This slowly started to change later in the decade, as
the public caught on…
Monday, July 11th, 2039
10. Shift in “publication”
recognition
Hard to believe now, but back then, people were
rewarded for the first (claimed) “observation” of an
effect.
Two-lab rule was only instated as best practice in the
early 2020s, once reviewers started rejecting papers
unaccompanied by a replication report.
Funding shift followed, of course.
Monday, July 11th, 2039
11. 3. Computing & data in biology
Of the sciences, biology had always
been the weakest in terms of computing
education.
This became a complete disaster once
the data tsunami hit – labs generated
data sets they couldn’t analyze,
graduate students planned experiments
that relied on computing they couldn’t
do.
Monday, July 11th, 2039
Photo from Wikipedia
12. The “easy to use” tools fiasco
Immense investment in late ‘teens in tools that were
“easy to use” – push-button data analysis, etc.
This worked well outside of research; however, it turns
out you can’t place most data analysis in a black box.
“Easy to use” tools embodied so many assumptions
that most results were simply invalid.
Monday, July 11th, 2039
13. => Bioinformatics
“sweatshops”
Cadre of students and low-paid employees devoted to
“service bioinformatics”
No career path, no significant authorship…
…but necessary for big labs to make progress!
Monday, July 11th, 2039
14. Things came to a head…
Monday, July 11th, 2039
www.sanantonio-urbanliving.com
15. The tipping point
The well-trained students left for the data science
industry;
More and more papers were being written by people
who didn’t understand the computing…
…and an increasing number of them were being
rejected…
…until the supply of reviewers ran out…
Monday, July 11th, 2039
17. Bioinformaticians, revolt!
Bioinformatics reviewers essentially unionized and laid down
three rules:
1. All of the data and source code must be provided for any
paper.
2. Full methods sections and references are included in the
primary paper review.
3. No unpublished methods can be used in data analysis.
In the end, the only people that complained were companies
like MS Elsevier, because preprints.
Monday, July 11th, 2039
19. Part of a larger renaissance
for biology!
Starting in ~2020,
1. Biomedical enterprise rediscovers basic biology;
2. Rise and triumph of open science;
3. A transition to networked science;
4. Massive investment in the people;
Monday, July 11th, 2039
21. The biomedical community backs away
from translational medicine.
Several veterinary and agricultural animals proved to
be better model organisms for human disease than
mouse;
Ecology and evolution provided valuable theoretical
and empirical observations for understanding human
genetics.
Microbial interactions between environment and human
proved to be important as well; built environment,
disease reservoirs, etc.
Cheap sequencing enabled a vast array of studies.
Monday, July 11th, 2039
22. 2. Open science triumphs!
The computational community knew this by 2016, but it
took a few years for the rest of biology…
A curious story!
1. Biotech pressured congresspeople into decreasing
funding for experiments, since analysis was usually
wrong and raw data was never available;
2. Funding crunch, more generally, tightened the screws
further;
3. Hypothesis driven labs couldn’t compete…
Monday, July 11th, 2039
23. …hypothesis-driven lab science joined
with discovery.
Eventually, funders mandated
data availability;
Labs that made use of available
data had a dramatic edge in
hypothesis-driven
experimentation;
Data-driven modeling and
model-driven data interpretation
blossomed!
Monday, July 11th, 2039
Image from emory.edu
24. 3. A transition to networked
science
Monday, July 11th, 2039
25. Universities collapsed!
So all the senior professors and administrators retired…
Massive brain drain…
… enabled a massive increase in creativity in the research
enterprise!
Collaboration tools, data sharing, distributed team science…
Monday, July 11th, 2039
26. “Walled garden” model
Monday, July 11th, 2039
Pioneered by Sage Bionetworks in ~2010s
Data collection done by small consortia;
Data made available to all, but publication in step.
Model is of course obsolete nowadays, but was quite effective back then.
27. 4. Massive investment in
people
The NIH finally invested heavily in training.
Among other things:
Data Carpentry
Model Carpentry
Monday, July 11th, 2039
(We won! Yay!)
28. There are still problems, of
course!
What do most genes do? Functional annotations are
still poor. Some approaches --
Biogeochemistry
Synthetic biology
Career paths for experimental biologists are very
uncertain.
“Glam data”
Cancer is cured, but many complex diseases –
especially neurodegenerative ones – remain poorly
understood.
Monday, July 11th, 2039
29. BRAIN2050
Ambitious 10-year proposal to “understand the brain”
by 2050.
Focus on neurodegenerative diseases, regeneration,
and a mechanistic understanding of intelligence.
What mistakes can they avoid, with the benefit of
hindsight?
Monday, July 11th, 2039
30. Correlation is not causation
You’d think we’d have learned this by now!?
Original MIND project 25 years ago failed for this
reason. (“Record ALL the neurons”)
Monday, July 11th, 2039 Image from Wikipedia
31. (Computational) modeling is
critical
Can we develop models that embody hypotheses that
we can then “test” against the data?
Holistic multidisciplinary research.
(Brain community has always been better off here…)
Monday, July 11th, 2039
32. Focus less on reproducibility
A strict requirement for independent replication is
strangling us!
Completely independent replication is a strong
requirement; understandable, given disasters of the
past, but also slow.
Can we compromise?
Monday, July 11th, 2039
33. “Replication debt”
Can we borrow idea of “technical debt” from software
engineering?
Semi-independent replication after initial exploratory
phase, followed by articulation of protocols and
independent replication.
Monday, July 11th, 2039
Image from blog.crisp.se
34. “Replication debt”
Semi-independent replication after initial exploratory
phase, followed by articulation of protocols and
independent replication.
Public acknowledgement of debt is important.
Monday, July 11th, 2039
Image from blog.crisp.se
35. Invest in infrastructure for
collaboration and sharing
Data sharing is a given
But existing tools still merely support rather than drive
science with data sharing!
Push for collaborative process from the outset.
Monday, July 11th, 2039
36. Can we help drive collaboration with
technology?
Monday, July 11th, 2039
Gather data
Deposit data
Compare
against other
data sets
Notify
Notify
See e.g. pebourne.wordpress.com/2014/01/04/universities-as-big-data/
37. Tool up! But evaluate,
compare, understand.
Having a robust and competitive software ecosystem is
important for innovation and creativity.
Available, open, reusable, remixable: all critical!
Benchmarks are not always useful; understanding
always is.
Monday, July 11th, 2039
38. Build commercial software only when
basics are understood
Monday, July 11th, 2039
Research development
Easy-to-use
commercial software
"Popular" protocols
39. Invest in training as first-class
research citizen!
Monday, July 11th, 2039
Undergraduates
K-12 students
Graduate students
The high school students of yesterday are
the research scientists of tomorrow.
40. It’s the network, dummies.
Single molecule full genome sequences did not provide
understanding.
Reductionist studies of gene function did not provide
understanding.
Neither will high resolution ensemble neuronal sampling.
Our main obstacle in understanding aging has been that it seems
to be systemic, just like neurogeneration.
Monday, July 11th, 2039
41. Concluding thoughts (I)
Many things the BRAIN2050 field can do to invest in its
own future and accelerate progress!
Bitter lessons learned from decades of mistakes in
other fields; maybe we can do better?
Monday, July 11th, 2039
43. All right…
Future talk over
I thought I’d use this as a foil to highlight issues that I
think are important for the future.
But:
44. We have to get used to the idea that radical
change keeps happening ... even after 1997.
First published by Broadway Books on May 5, 1997. Via Erich Schwarz
45. We have to get used to the idea that radical
change keeps happening ... even after 1997.
"Among the pessimists, molecular biologist Gunther Stent
suggests that science is reaching a point of incremental,
diminishing returns as it comes up against the limits of
knowledge..." --review by Publishers Weekly
First published by Broadway Books on May 5, 1997. Via Erich Schwarz
46. Robert Heinlein's four curves of predicted human
progress (described in 1950)
Ref.: Heinlein, R.A. (1950), "Where To?".
"The solid curve ...
represents many things -
- use of power, speed of
transport, numbers of
scientific and technical
workers, advance in
communication, average
miles traveled per
person per year,
advances in
mathematics ... Call it
the curve of human
achievement."
Via Erich Schwarz
47. Robert Heinlein's four curves of predicted human
progress (described in 1950)
"Despite everything,
there is a stubborn
'common sense'
tendency to project it
along dotted line
number (1) like the
patent office official of a
hundred years back
who quit his job
'because everything
had already been
invented'."
Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
48. Robert Heinlein's four curves of predicted human
progress (described in 1950)
"Even those who don't
expect a slowing up at
once tend to expect us
to reach a point of
diminishing returns --
dotted line number (2)."
Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
49. Robert Heinlein's four curves of predicted human
progress (described in 1950)
"Very daring minds are
willing to predict that we
will continue our
present rate of progress
-- dotted line number
(3) -- a tangent."
Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
50. Robert Heinlein's four curves of predicted human
progress (described in 1950)
Ref.: Heinlein, R.A. (1950), "Where To?".
"But the proper way to
project the curve is
dotted line number (4),
because there is no
reason, mathematical,
scientific, or historical,
to expect that curve to
flatten out... The correct
projection ... is for the
curve to go on up
indefinitely with
increasing steepness..."
Via Erich Schwarz
51. Conclusion --
I certainly don’t know where we’re headed; no one else does
either.
We must invest in people and process; we must help figure
out what the right process is and then provide career
incentives for people to do things that way.
This community should be leading the way:
Bioinformatics Open Source Conference
(Reminder: we will win.)
54. Prospects for U.S. public funding of science
Ref.: U.S. Government Accountability Office, Citizen's Guide of 2010.
55. Public support for science
matters!
Data sharing, openness => maximizing return.
Must figure out how to align career and funding
incentives.
We are currently doing a horrible job of this…
…I’m looking forward to Phil Bourne’s talk :)
Monday, July 11th, 2039
56. Thanks!
Discussions with Phil Bourne (NIH), Erich Schwarz (Caltech
& Cornell), Katherine Mejia-Guerra (OSU) and Jeffrey
Campbell (OSU).
All of this will be (is already?) posted online.
“The next 10 years of quant bio” by Mike Schatz
…with apologies to Gary Bernhardt
(Birth & Death of JavaScript – go watch it!)
Monday, July 11th, 2039