Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2013 bms-retreat-talk

874 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

2013 bms-retreat-talk

  1. 1. Data-intensive approaches to investigating non-model organisms C. Titus Brown ctb@msu.edu Assistant Professor Microbiology and Molecular Genetics; Computer Science and Engineering; BEACON; Quantitative Biology Initiative
  2. 2. Outline • My research! • Opportunities for computational science training • More unsolicited advice
  3. 3. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Arend Hintze • Rosangela Canino-Koning • Qingpeng Zhang • Elijah Lowe • Likit Preeyanon • Jiarong Guo • Tim Brom • Kanchan Pavangadkar • Eric McDonald • Jim Tiedje, MSU • Erich Schwarz, Caltech / Cornell • Paul Sternberg, Caltech • Robin Gasser, U. Melbourne • Weiming Li • Hans Cheng Funding USDA NIFA; NSF IOS; BEACON; NIH.
  4. 4. My interests I work primarily on organisms of agricultural, evolutionary, or ecological importance, which tend to have poor reference genomes and transcriptomes. Focus on: • Improving assembly sensitivity to better recover genomic/transcriptomic sequence, often from “weird” samples. • Scaling sequence assembly approaches so that huge assemblies are possible and big assemblies are straightforward. • “Better science through superior software”
  5. 5. There is quite a bit of life left to sequence & assemble. http://pacelab.colorado.edu/
  6. 6. “Weird” biological samples: • Single genome • Transcriptome • High polymorphism data • Whole genome amplified • Metagenome (mixed microbial community) • Hard to sequence DNA (e.g. GC/AT bias) • Differential expression! • Multiple alleles • Often extreme amplification bias • Differential abundance within community.
  7. 7. Single genome assembly is already challenging --
  8. 8. Once you start sequencing metagenomes…
  9. 9. DNA sequencing • Observation of actual DNA sequence • Counting of molecules Image: Werner Van Belle
  10. 10. Fast, cheap, and easy to generate. Image: Werner Van Belle
  11. 11. New problem: data analysis & integration! • Once you can generate virtually any data set you want… • …the next problem becomes finding your answer in the data set! • Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…
  12. 12. “Heuristics” • What do computers do when the answer is either really, really hard to compute exactly, or actually impossible? • They approximate! Or guess! • The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.
  13. 13. Oftenexplicitor implicittradeoffs between compute“amount”and quality of result http://www.infernodevelopment.com/how- computer-chess-engines-think-minimax-tree
  14. 14. My actual research focus What we do is think about ways to get computers to play chess better, by: • Identifying better ways to guess; • Speeding up the guessing process; • Improving people’s ability to use the chess playing computer Now, replace “play chess” with “analyze biological data”...
  15. 15. My actual research focus… We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their scientific questions. This touches on many problems, including: • Computational and scientific correctness. • Computational efficiency. • Cultural divides between experimental biologists and computational scientists. • Lack of training (biology and medical curricula devoid of math and computing).
  16. 16. Not-so-secretsauce:“digitalnormalization” • One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.
  17. 17. Approach: Digital normalization (acomputationalversionoflibrarynormalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  18. 18. Digital normalization
  19. 19. Digital normalization
  20. 20. Digital normalization
  21. 21. Digital normalization
  22. 22. Digital normalization
  23. 23. Digital normalization
  24. 24. Digital normalization approach A digital analog to cDNA library normalization, diginorm: • Is single pass: looks at each read only once; • Does not “collect” the majority of errors; • Keeps all low-coverage reads; • Smooths out coverage of regions.
  25. 25. http://en.wikipedia.org/wiki/JPEG Lossy compression
  26. 26. http://en.wikipedia.org/wiki/JPEG Lossy compression
  27. 27. http://en.wikipedia.org/wiki/JPEG Lossy compression
  28. 28. http://en.wikipedia.org/wiki/JPEG Lossy compression
  29. 29. http://en.wikipedia.org/wiki/JPEG Lossy compression
  30. 30. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Restated: Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.) ~2 GB – 2 TB of single-chassis RAM
  31. 31. Soil metagenome assembly • Observation: 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”) • Many reasons why you can’t or don’t want to culture: • Syntrophic relationships • Niche-specificity or unknown physiology • Dormant microbes • Abundance within communities Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.
  32. 32. Investigating soil microbial ecology • What ecosystem level functions are present, and how do microbes do them? • How does agricultural soil differ from native soil? • How does soil respond to climate perturbation? • Questions that are not easy to answer without shotgun sequencing: • What kind of strain-level heterogeneity is present in the population? • What does the phage and viral population look like? • What species are where?
  33. 33. SAMPLING LOCATIONS
  34. 34. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  35. 35. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assemblyresults for Iowacorn and prairie (2x~300Gbpsoilmetagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  36. 36. Strain variation?Toptwoallelefrequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  37. 37. Tentative observations from our soil samples: • We need 100x as much data… • Much of our sample may consist of phage. • Phylogeny varies more than functional predictions. • We see little to no strain variation within our samples • Not bulk soil -- • Very small, localized, and low coverage samples • We may be able to do selective really deep sequencing and then infer the rest from 16s. • Implications for soil aggregate assembly?
  38. 38. I also work on… • Genome assembly & analysis • Transcriptome assembly and analysis • Interpretation of annoying large data sets
  39. 39. Whatarethetissuelevelchangesingeneexpressionthatsupportregeneration? TranscriptomeanalysisofaregeneratingvertebrateafterSCI brain spinal cord RNA-Seq to determine differential expression profile after injury Sampling >weekly -/+ Dex Ona Bloom
  40. 40. Training opportunities • PLB/MMG 810 (Shiu; ??) • CSE 801/Intro BEACON course (Brown; FS ‘13) “Intro to Computational Science for Evolutionary Biologists” • CSE 801 bootcamp (late Sep) • Software Carpentry bootcamp(s) (late Sep) • Workshops in Applied Bioinformatics (Buell; ‘14?) • Next-Gen Sequence Analysis Workshop (Brown; summer ‘14) + a variety of genomics courses that I can’t keep track of! Becky Mansel will have these slides.
  41. 41. Unsolicited advice Consider both faculty and non-faculty careers. • It’s a bad time to be looking for faculty positions, and it’s a bad time to be looking for funding; maybe this will improve in 10 years, maybe not. • A PhD qualifies you for many, many more things than we will (or can) tell you about! • Specific advice: • Network with industry folk; think beyond your advisor’s career. • Write a blog: ivory.idyll.org/blog/advice-to-scientists-on- blogging.html

×