Successfully reported this slideshow.
Your SlideShare is downloading. ×

2014 aus-agta

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
2016 davis-biotech
2016 davis-biotech
Loading in …3
×

Check these out next

1 of 56 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to 2014 aus-agta (20)

Advertisement

Recently uploaded (20)

2014 aus-agta

  1. 1. WHAT’S AHEAD FOR BIOLOGY? THE DATA INTENSIVE FUTURE C. Titus Brown ctb@msu.edu Assistant Professor, Michigan State University (In January, moving to UC Davis / VetMed.) Talk slides on slideshare.net/c.titus.brown
  2. 2. The Data Deluge (a traditional requirement for these talks)
  3. 3. The short version • Data gathering & storage is growing, leaps & bounds! • Biology is completely unprepared for this at every level: • Technical and infrastructure • Cultural • Training • Our funding/incentivization/prioritization structures are also largely unprepared. • This is a huge missed opportunity!! (What does Titus think we should be doing?)
  4. 4. Challenges: 1. Dealing with Big Data (my current research) 2. Interpreting the unknowns (future research) 3. Accelerating research with better data/methods/ results sharing. 4. Expanding the role of exploratory data analysis in biology. (career windmill)
  5. 5. 1. Dealing with Big Data A. Lossy compression B. Streaming algorithms
  6. 6. Looking forward 5 years… Navin et al., 2011
  7. 7. Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  8. 8. Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  9. 9. Approach A: Lossy compression (Reduce volume of data & compute infrastructure requirements) Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  10. 10. http://en.wikipedia.org/wiki/JPEG Lossy compression
  11. 11. http://en.wikipedia.org/wiki/JPEG Lossy compression
  12. 12. http://en.wikipedia.org/wiki/JPEG Lossy compression
  13. 13. http://en.wikipedia.org/wiki/JPEG Lossy compression
  14. 14. http://en.wikipedia.org/wiki/JPEG Lossy compression
  15. 15. Digital normalization
  16. 16. Digital normalization
  17. 17. Digital normalization
  18. 18. Digital normalization
  19. 19. Digital normalization
  20. 20. Digital normalization
  21. 21. e.g. de novo assembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis Brown et al., arXiv, 2012
  22. 22. Hey, cool, our approach and software is used by Illumina for long-read sequencing!
  23. 23. Our general strategy: compressive prefilters Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  24. 24. Approach B: streaming data analysis (Reduce latency for clinical applications) Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  25. 25. Current variant calling approaches are multipass Data Mapping Sorting Calling Answer
  26. 26. Streaming graph-based approaches can detect information saturation
  27. 27. Approach supports compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  28. 28. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  29. 29. Integrate sequencing and analysis Sequencing Analysis Are we done yet? Decrease latency!
  30. 30. So, how do we deal with Big Data issues? • Fairly record cost of data analysis (running software & cost of computational infrastructure) • This incentivizes development of better approaches! • Lossy compression, streaming, …?? • Think 5 years ahead, rather than 2 years behind! • Pay attention to workflows, software lifecycle, etc. etc. (See ABiC 2014 talk :)
  31. 31. 2. Dealing with the unknowns Compare sample to control Eliminate all the things we don't know how to interpret Wonder why we can only account for ~50% effect. ~millions ~10s of thousands
  32. 32. “What is the function of ….?” We can observe almost everything at a DNA/RNA level! But, • Experimentally based functional annotations are sparse; • Most genes play multiple roles and are generally annotated for only one; • Model organisms are phylogenetically quite limited and biased; • …there is little or no $$$ or reputation gain for characterizing novel genes (and nor is it straightforward or easy to do so!)
  33. 33. The problem of lopsided gene characterization: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Slide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
  34. 34. How do we systematically broaden our functional understanding of genes? 1. More experimental work! • Population studies, perturbation studies, good ol’ fashioned molecular biology, etc. 2. Integrate modeling, to see where we have (or lack) sufficiency of knowledge for a particular phenotype. 3. Sequence it all and let the bioinformaticians sort it out! What I think will work best: a tight integration between all three approaches (c.f. physics) – hypothesis-driven investigation, modeling, and exploratory data science. See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html
  35. 35. 3. Accelerating research with better sharing of results, data, methods. Our current journal system is a 20th century solution to a 17th century problem. - Paraphrased from Cameron Neylon (Note: 20th century was LAST century)
  36. 36. 3. Accelerating research with better sharing of results, data, methods. We could accelerate research with better sharing. Recent example re rare diseases: http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind- 2 “The current academic publication system does patients an enormous disservice.” – Daniel MacArthur There are many barriers to better communication of results, data, and methods, but most of them are cultural, not technical. (Much harder!)
  37. 37. Preprints • Many fields (including bioinformatics and increasingly genomics) routinely share papers prior to publication. This facilitates reproduction, dissemination, and ultimately progress. • Biology is behind the times! See: 1. Haldane’s Sieve (blog discussion of preprints) 2. Evidence that preprints confer massive citation advantage in physics (http://arxiv.org/abs/0906.5418)
  38. 38. Current model for data sharing In a data limited world, this kind of made sense. Gather data Interpret data Write & publish paper Grudgingly share as little data as possible
  39. 39. Current model for data sharing This model ignores the fact that data often has multiple (unrealized or serendipitous) uses. (Among many other problems ;) Gather data Interpret data Write & publish paper Grudgingly share as little data as possible
  40. 40. The train wreck ahead Gather data Interpret data most data doesn’t get published, Write & publish paper When data is cheap, and interpretation is expensive, and therefore is lost. Grudgingly share as little data as possible (Program managers are not a fan of this)
  41. 41. Data sharing challenges - • Little immediate or career rewards for sharing data; incentives are almost entirely punitive (if you DON’T…) • Sharing data in a usable form is still rather difficult. • Submitting data to archival services is, in many cases, surprisingly difficult. • Few methods for gaining recognition for data sharing prior to publication of conclusions.
  42. 42. The Ocean Cruise Model One really expensive cruise, many data collectors, shared data. DeepDOM – photo courtesy E. Kujawinski, WHOI
  43. 43. Sage Bionetworks / “walled garden” Collaborative data sharing policy with restricted access to outsiders; Central platform with analysis provenance tracking; A model for the future of biomedical research? See, e.g., Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.
  44. 44. Distributed cyberinfrastructure to encourage sharing? Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html
  45. 45. Better metadata collection is needed! Suppose the NSA could EITHER track who was calling whom, OR what they were saying – which would be more valuable? Who? What? Who?
  46. 46. Better metadata collection is needed! Suppose the NSA could EITHER track who was calling whom, OR what they were saying – which would be more valuable? Who? What? Who?
  47. 47. Better metadata collection is needed! We need to track sample origin, phenotype/environmental conditions, etc. Sample information The –omic data Phenotype This will facilitate discovery, serendipity, re-analysis, and cross-validation.
  48. 48. Data and software citation Now methods for: • assigning DOIs to data (which makes it citable) – figshare, dryad. • Data publications – gigascience, SIGS, Scientific Data. • Software citations – Zenodo, MozSciLab/GitHub • Software publications – F1000 Research Will this address the need to incentivize data sharing and methods? Probably not but it’s a good start ;)
  49. 49. 4. Exploratory data analysis Old model: Gather data Interpret data
  50. 50. New model Your data is most useful when combined with everyone else’s. Gather data Interpret data Database Database Other people's data Database Database Database Other people's data Other people's data Other people's data Other people's data Other people's data Other people's Database data Database Database Database
  51. 51. Given enough publicly accessible data… Interpret data Database Database Other people's data Database Database Database Other people's data Other people's data Other people's data Other people's data Other people's data Other people's Database data Database Database Database
  52. 52. But: we face lack of training. The lack of training in data science is the biggest challenge facing biology. Students! There’s a great future in data analysis! Also see:
  53. 53. Data integration? Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014) Illumina estimate: 228,000 human genomes will be sequenced in 2014, mostly by researchers. http://www.technologyreview.com/news/53 1091/emtech-illumina-says-228000- human-genomes-will-be-sequenced-this-year/
  54. 54. Looking to the future For the senior scientists and funders amongst us, • How do we incentivize data sharing, and training? • How do we fund the meso- and micro-scale cyberinfrastructure development that will accelerate bio? The NIH and NSF are exploring this; the Moore and Sloan foundations are simply doing it (but 1% the size). See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html
  55. 55. Thanks for listening!
  56. 56. For Australian students and early career researchers in bioinformatics and computational biology Annual Student Symposium Friday 28th November 2014 Parkville, Victoria Now accepting abstracts for talks and posters Talk abstracts close 31st October combine.org.au

×