Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014 bosc-keynote


Published on

Published in: Science, Technology

2014 bosc-keynote

  1. 1. It’s hard to make predictions – especially about the future. -- attributed to Niels Bohr Monday, July 11th, 2039
  2. 2. A History of “Bioinformatics” C. Titus Brown Monday, July 11th, 2039 Monday, July 11th, 2039
  3. 3. Invited to reminisce!  …and perhaps inform the BRAIN2050 initiative. Note for the young: “bioinformatics” and “systems biology” are now simply “biology”. Monday, July 11th, 2039
  4. 4. The 20-teens and onwards 1. Too Much Data: The Datapocalypse 2. Great results, seen once: the reproducibility crisis. 3. Mind the gap: computation in biology. Monday, July 11th, 2039
  5. 5. 1. The Datapocalypse Monday, July 11th, 2039
  6. 6. Too… much… data… Between –omics, automated sensor data, and data sharing, biology grew into a data-intensive science. Volume, velocity, variety: the general problem. But also! Biology was optimized for hypothesis-driven investigation, not data exploration! Long arguments over “which is better”, with the people who controlled the funding => winning. Monday, July 11th, 2039
  7. 7. HTC, not HPC For lots of data, High Throughput Computing was needed – but compute was cheap, not throughput! Monday, July 11th, 2039 Figure from
  8. 8. 2. The reproducibility crisis Monday, July 11th, 2039 Trials Failed
  9. 9. The reproducibility crisis - why??  Well known fact among biotech that the majority of published experiments were largely lab-specific.  Neither career incentives nor funding were there! (In fact, quite the contrary…)  This slowly started to change later in the decade, as the public caught on… Monday, July 11th, 2039
  10. 10. Shift in “publication” recognition  Hard to believe now, but back then, people were rewarded for the first (claimed) “observation” of an effect.  Two-lab rule was only instated as best practice in the early 2020s, once reviewers started rejecting papers unaccompanied by a replication report.  Funding shift followed, of course. Monday, July 11th, 2039
  11. 11. 3. Computing & data in biology Of the sciences, biology had always been the weakest in terms of computing education. This became a complete disaster once the data tsunami hit – labs generated data sets they couldn’t analyze, graduate students planned experiments that relied on computing they couldn’t do. Monday, July 11th, 2039 Photo from Wikipedia
  12. 12. The “easy to use” tools fiasco  Immense investment in late ‘teens in tools that were “easy to use” – push-button data analysis, etc.  This worked well outside of research; however, it turns out you can’t place most data analysis in a black box.  “Easy to use” tools embodied so many assumptions that most results were simply invalid. Monday, July 11th, 2039
  13. 13. => Bioinformatics “sweatshops”  Cadre of students and low-paid employees devoted to “service bioinformatics”  No career path, no significant authorship…  …but necessary for big labs to make progress! Monday, July 11th, 2039
  14. 14. Things came to a head… Monday, July 11th, 2039
  15. 15. The tipping point  The well-trained students left for the data science industry;  More and more papers were being written by people who didn’t understand the computing…  …and an increasing number of them were being rejected…  …until the supply of reviewers ran out… Monday, July 11th, 2039
  16. 16. And then… California. Monday, July 11th, 2039 Map from Wikimedia
  17. 17. Bioinformaticians, revolt! Bioinformatics reviewers essentially unionized and laid down three rules: 1. All of the data and source code must be provided for any paper. 2. Full methods sections and references are included in the primary paper review. 3. No unpublished methods can be used in data analysis. In the end, the only people that complained were companies like MS Elsevier, because preprints. Monday, July 11th, 2039
  18. 18. Replication “parties” Open peer review Replication study! Bad? Group I Group II Group III Publication of replication attempt Monday, July 11th, 2039 A community of practice emerged around replication!
  19. 19. Part of a larger renaissance for biology! Starting in ~2020, 1. Biomedical enterprise rediscovers basic biology; 2. Rise and triumph of open science; 3. A transition to networked science; 4. Massive investment in the people; Monday, July 11th, 2039
  20. 20. 1. Rediscovering basic biology Monday, July 11th, 2039
  21. 21. The biomedical community backs away from translational medicine.  Several veterinary and agricultural animals proved to be better model organisms for human disease than mouse;  Ecology and evolution provided valuable theoretical and empirical observations for understanding human genetics.  Microbial interactions between environment and human proved to be important as well; built environment, disease reservoirs, etc.  Cheap sequencing enabled a vast array of studies. Monday, July 11th, 2039
  22. 22. 2. Open science triumphs!  The computational community knew this by 2016, but it took a few years for the rest of biology… A curious story! 1. Biotech pressured congresspeople into decreasing funding for experiments, since analysis was usually wrong and raw data was never available; 2. Funding crunch, more generally, tightened the screws further; 3. Hypothesis driven labs couldn’t compete… Monday, July 11th, 2039
  23. 23. …hypothesis-driven lab science joined with discovery.  Eventually, funders mandated data availability;  Labs that made use of available data had a dramatic edge in hypothesis-driven experimentation;  Data-driven modeling and model-driven data interpretation blossomed! Monday, July 11th, 2039 Image from
  24. 24. 3. A transition to networked science Monday, July 11th, 2039
  25. 25. Universities collapsed!  So all the senior professors and administrators retired…  Massive brain drain…   … enabled a massive increase in creativity in the research enterprise!  Collaboration tools, data sharing, distributed team science… Monday, July 11th, 2039
  26. 26. “Walled garden” model Monday, July 11th, 2039 Pioneered by Sage Bionetworks in ~2010s  Data collection done by small consortia;  Data made available to all, but publication in step. Model is of course obsolete nowadays, but was quite effective back then.
  27. 27. 4. Massive investment in people The NIH finally invested heavily in training. Among other things: Data Carpentry Model Carpentry Monday, July 11th, 2039 (We won! Yay!)
  28. 28. There are still problems, of course!  What do most genes do? Functional annotations are still poor. Some approaches --  Biogeochemistry  Synthetic biology  Career paths for experimental biologists are very uncertain.  “Glam data”  Cancer is cured, but many complex diseases – especially neurodegenerative ones – remain poorly understood. Monday, July 11th, 2039
  29. 29. BRAIN2050  Ambitious 10-year proposal to “understand the brain” by 2050.  Focus on neurodegenerative diseases, regeneration, and a mechanistic understanding of intelligence.  What mistakes can they avoid, with the benefit of hindsight? Monday, July 11th, 2039
  30. 30. Correlation is not causation  You’d think we’d have learned this by now!?  Original MIND project 25 years ago failed for this reason. (“Record ALL the neurons”) Monday, July 11th, 2039 Image from Wikipedia
  31. 31. (Computational) modeling is critical  Can we develop models that embody hypotheses that we can then “test” against the data? Holistic multidisciplinary research. (Brain community has always been better off here…) Monday, July 11th, 2039
  32. 32. Focus less on reproducibility  A strict requirement for independent replication is strangling us!  Completely independent replication is a strong requirement; understandable, given disasters of the past, but also slow.  Can we compromise? Monday, July 11th, 2039
  33. 33. “Replication debt”  Can we borrow idea of “technical debt” from software engineering?  Semi-independent replication after initial exploratory phase, followed by articulation of protocols and independent replication. Monday, July 11th, 2039 Image from
  34. 34. “Replication debt”  Semi-independent replication after initial exploratory phase, followed by articulation of protocols and independent replication.  Public acknowledgement of debt is important. Monday, July 11th, 2039 Image from
  35. 35. Invest in infrastructure for collaboration and sharing  Data sharing is a given  But existing tools still merely support rather than drive science with data sharing!  Push for collaborative process from the outset. Monday, July 11th, 2039
  36. 36. Can we help drive collaboration with technology? Monday, July 11th, 2039 Gather data Deposit data Compare against other data sets Notify Notify See e.g.
  37. 37. Tool up! But evaluate, compare, understand.  Having a robust and competitive software ecosystem is important for innovation and creativity.  Available, open, reusable, remixable: all critical!  Benchmarks are not always useful; understanding always is. Monday, July 11th, 2039
  38. 38. Build commercial software only when basics are understood Monday, July 11th, 2039 Research development Easy-to-use commercial software "Popular" protocols
  39. 39. Invest in training as first-class research citizen! Monday, July 11th, 2039 Undergraduates K-12 students Graduate students The high school students of yesterday are the research scientists of tomorrow.
  40. 40. It’s the network, dummies.  Single molecule full genome sequences did not provide understanding.  Reductionist studies of gene function did not provide understanding.  Neither will high resolution ensemble neuronal sampling.  Our main obstacle in understanding aging has been that it seems to be systemic, just like neurogeneration. Monday, July 11th, 2039
  41. 41. Concluding thoughts (I)  Many things the BRAIN2050 field can do to invest in its own future and accelerate progress!  Bitter lessons learned from decades of mistakes in other fields; maybe we can do better? Monday, July 11th, 2039
  42. 42. Monday, July 11th, 2039
  43. 43. All right…  Future talk over   I thought I’d use this as a foil to highlight issues that I think are important for the future. But:
  44. 44. We have to get used to the idea that radical change keeps happening ... even after 1997. First published by Broadway Books on May 5, 1997. Via Erich Schwarz
  45. 45. We have to get used to the idea that radical change keeps happening ... even after 1997. "Among the pessimists, molecular biologist Gunther Stent suggests that science is reaching a point of incremental, diminishing returns as it comes up against the limits of knowledge..." --review by Publishers Weekly First published by Broadway Books on May 5, 1997. Via Erich Schwarz
  46. 46. Robert Heinlein's four curves of predicted human progress (described in 1950) Ref.: Heinlein, R.A. (1950), "Where To?". "The solid curve ... represents many things - - use of power, speed of transport, numbers of scientific and technical workers, advance in communication, average miles traveled per person per year, advances in mathematics ... Call it the curve of human achievement." Via Erich Schwarz
  47. 47. Robert Heinlein's four curves of predicted human progress (described in 1950) "Despite everything, there is a stubborn 'common sense' tendency to project it along dotted line number (1) like the patent office official of a hundred years back who quit his job 'because everything had already been invented'." Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
  48. 48. Robert Heinlein's four curves of predicted human progress (described in 1950) "Even those who don't expect a slowing up at once tend to expect us to reach a point of diminishing returns -- dotted line number (2)." Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
  49. 49. Robert Heinlein's four curves of predicted human progress (described in 1950) "Very daring minds are willing to predict that we will continue our present rate of progress -- dotted line number (3) -- a tangent." Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
  50. 50. Robert Heinlein's four curves of predicted human progress (described in 1950) Ref.: Heinlein, R.A. (1950), "Where To?". "But the proper way to project the curve is dotted line number (4), because there is no reason, mathematical, scientific, or historical, to expect that curve to flatten out... The correct projection ... is for the curve to go on up indefinitely with increasing steepness..." Via Erich Schwarz
  51. 51. Conclusion --  I certainly don’t know where we’re headed; no one else does either.  We must invest in people and process; we must help figure out what the right process is and then provide career incentives for people to do things that way. This community should be leading the way: Bioinformatics Open Source Conference (Reminder: we will win.)
  52. 52. But: economics matter 50-million mark note. Weimar Germany, 1923.
  53. 53. Economics matter Ref.: U.S. Government Accountability Office, Citizen's Guide of 2010.
  54. 54. Prospects for U.S. public funding of science Ref.: U.S. Government Accountability Office, Citizen's Guide of 2010.
  55. 55. Public support for science matters!  Data sharing, openness => maximizing return.  Must figure out how to align career and funding incentives.  We are currently doing a horrible job of this…  …I’m looking forward to Phil Bourne’s talk :) Monday, July 11th, 2039
  56. 56. Thanks!  Discussions with Phil Bourne (NIH), Erich Schwarz (Caltech & Cornell), Katherine Mejia-Guerra (OSU) and Jeffrey Campbell (OSU).  All of this will be (is already?) posted online.  “The next 10 years of quant bio” by Mike Schatz  …with apologies to Gary Bernhardt (Birth & Death of JavaScript – go watch it!) Monday, July 11th, 2039