Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Job Talk Iowa State University Ag Bio Engineering

1,626 views

Published on

My job talk for my ISU Engineering interview for a "Big Data" position in March, 2014

Published in: Engineering
  • Be the first to comment

Job Talk Iowa State University Ag Bio Engineering

  1. 1. RIDING THE BIG DATA TIDAL WAVE IN MODERN MICROBIOLOGY IOWA STATE UNIVERSITY MARCH 12, 2014 Adina Howe, PhD
  2. 2. Outline of talk My multi-discipline career Biological sequencing: a game changer Research – computational focus: How to handle “big data” in biology Research – biological focus: The gut microbiome’s role in obesity? Future research: A flexible toolbox in a big playground
  3. 3. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability)
  4. 4. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability) University of Iowa, PhD, Environmental Engineering (Microbiology/Bioremediatio n)
  5. 5. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability) University of Iowa, PhD, Environmental Engineering (Microbiology/Bioremediatio n) Michigan State University NSF Postdoc Math and Biology Fellow (cross- training) Microbial Ecology (Jim Tiedje) Bioinformatics (Titus Brown)
  6. 6. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability) University of Iowa, PhD, Environmental Engineering (Microbiology/Bioremediatio n) Michigan State University NSF Postdoc Math and Biology Fellow (cross- training) Microbial Ecology (Jim Tiedje) Bioinformatics (Titus Brown) Computational Biologist Microbiology / Microbial Ecology
  7. 7. Our shared challenges Climate Change Energy Supply USGCRP 2009 www.alutiiq.com http://guardianlv.com/ Human Health An understanding of microbial ecology
  8. 8. Environmental continuum MICROBES IN ECOSYSTEMS NATURE AIR WATER SOIL MICROBIOMES HUMANS/ANIMAL ENGINEERED BIOREACTORS WASTEWATER
  9. 9. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it? Kim Lewis, 2010
  10. 10. Gene / Genome Sequencing  Collect samples  Extract DNA  Sequence DNA  “Analyze” DNA to identify its content and origin Taxonomy (e.g., pathogenic E. Coli) Function (e.g., degrades cellulose)
  11. 11. Cost of Sequencing Stein, Genome Biology, 2010 E. Coli genome 4,500,000 bp ($4.5M, 1992) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 1 10 100 1,000 10,000 100,000 1,000,000 DNASequencing,Mbpper$ 10,000,000 100,000,000
  12. 12. Rapidly decreasing costs with NGS Sequencing Stein, Genome Biology, 2010 Next Generation Sequencing 4,500,000 bp (E. Coli, $200, presently) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 1 10 100 1,000 10,000 100,000 1,000,000 DNASequencing,Mbpper$ 10,000,000 100,000,000
  13. 13. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  14. 14. The experimental continuum Single Isolate Pure Culture Enrichment Mixed Cultures Natural systems
  15. 15. The era of big data in biology Stein, Genome Biology, 2010 Computational Hardware (doubling time 14 months) Sanger Sequencing (doubling time 19 months) NGS (Shotgun) Sequencing (doubling time 5 months) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0 1 10 100 1,000 10,000 100,000 1,000,000 DiskStorage,Mb/$ 0.1 1 10 100 1,000 10,000 100,000 1,000,000 DNASequencing,Mbpper$ 10,000,000 100,000,000 0.1 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
  16. 16. Postdoc experience with data 2003-2008 Cumulative sequencing in PhD = 2000 bp 2008-2009 Postdoc Year 1 = 50 Gbp 2009-2010 Postdoc Year 2 = 450 Gbp
  17. 17. Flexibility towards embracing change. How to survive a data deluge? Experimen t Design Data Generatio n Workflow / Tools Data analysis Applied Solutions
  18. 18. Reducing data volume: Assembly of Metagenomic Sequences MSU: C. Titus Brown and James Tiedje
  19. 19. de novo assembly Compresses dataset size significantly Improved data quality (longer sequences, gene order) Reference not necessary (novelty) Raw sequencing data (“reads”) Computational algorithms Informative genes / genome
  20. 20. Metagenome assembly…a scaling problem.
  21. 21. Shotgun sequencing and de novo assembly It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  22. 22. Practical Challenges – Intensive computing Howe et al, 2014, PNAS Months of “computer crunching” on a super computer
  23. 23. Practical Challenges – Intensive computing Howe et al, 2014, PNAS Months of “computer crunching” on a super computer Assembly of 300 Gbp can be done with any assembly program in less than 14 GB RAM and less than 24 hours.
  24. 24. Natural community characteristics  Diverse  Many organisms (genomes)
  25. 25. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x
  26. 26. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x
  27. 27. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x Overkill
  28. 28. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  29. 29. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  30. 30. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  31. 31. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  32. 32. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  33. 33. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS  Scales datasets for assembly up to 95% - same assembly outputs.  Genomes, mRNA-seq, metagenomes (soils, gut, water)
  34. 34. Partitioning (khmer software) Pell et al, 2012, PNAS Howe et al., 2014, PNAS  Separates metagenomes by species  Parallel computing possible  Largest known published soil metagenome and assembly
  35. 35. Tackling Soil Biodiversity Source: Chuck Hane
  36. 36. Tackling Soil Biodiversity  Grand Challenge effort – 10% of soil biodiversity sampled  Incredible soil biodiversity (estimate required 10 Tbp/sample)  “To boldly go where no man has gone before”: >60% Unknown 0 100 200 300 400 aminoacidmetabolism carbohydratemetabolism membranetransport signaltransduction translation folding,sortinganddegradation metabolismofcofactorsandvitamins energymetabolism transportandcatabolism lipidmetabolism transcription cellgrowthanddeath replicationandrepair xenobioticsbiodegradationandmetabolism nucleotidemetabolism glycanbiosynthesisandmetabolism metabolismofterpenoidsandpolyketides cellmotility TotalCount KO corn and prairie corn only prairie only Howe et al, 2014, PNAS
  37. 37. Big data combined with microbiology will changes lives. 37
  38. 38. The health and stability of the gut microbiome (in response to diet change) University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38 Experimen t Design Data Generatio n Workflow / Tools Data analysis Applied Solutions
  39. 39. We are supraorganisms 39
  40. 40. Interactions between the microbiome and the environment 40 Source: Zhao, 2013 Obesity Intestinal inflammation IBD diseases Diet has a greater potential to shape the structure and function of gut than host genetics. Direct influence on health state
  41. 41. How resilient is the microbiome? 41 In mice, recovery from long term shift to obesity-inducing diet In humans, microbiome rapidly and reproducibly recovers within 2 days (2013) In mice, rapid recovery from long term shift to obesity-inducing diet (2012)
  42. 42. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 42 Bacterial cells Bacterial cells infected with bacteriophage Viruses (Bacteriophage)  Vary by individual (Minot et al., 2011)  Altered by diet and co-vary with bacteria (Minot et al., 2011)  Long term stable (Minot et al., 2013)  Largely temperate (Reyes et al., 2013) Prophage Who is in the gut microbiome?
  43. 43. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 43
  44. 44. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 44
  45. 45. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 45
  46. 46. Research Questions 46  What are the impacts of different diets on gut microbiome response?  What are the impacts of viruses in the gut microbiome (rapid alteration and resilient response?)  Multidisciplinary approach combining  novel experimental targeting of both bacterial and viral communities  metagenomic-based sequencing to characterize community
  47. 47. Novel experimental design – targeted sampling of community fractions I. Total DNA (bacteria + prophage + viruses) TOT II. Virus-like particles (free-living viruses) VLP III. Induced prophage IND 47 Separation by density Chemically separate Separation by size Microbiome through faecal matter (non destructive sampling)
  48. 48. Two baseline diets (with a perturbation) Low-fat (LF) baseline diet Milk-fat (MF) baseline diet Age (wk) 4 5 6 7 8 9 10 11 12 13 14 Diet Switch Washout (Return to BaselinBaseline Total community function: TOT metagenomic sequencing at weeks 8, 11, 14 Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14 Weight of mice and count of VLPS with microscopy Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14. 48 LF / 10% Fat / Complex Carbs MF / 37% Fat / Simple Sugars MF LF MF LF Fecal Samples
  49. 49. Outcomes? 49 Low-fat (LF) baseline diet Milk-fat (MF) baseline diet Age (wk) 4 5 6 7 8 9 10 11 12 13 14 Diet Switch Washout (Return to BaselinBaseline LF / 10% Fat / Complex Carbs MF / 37% Fat / Simple Sugars MF LF MF LF Qualitative and Quantitative Measurements: Who is there? What are they doing? How much?
  50. 50. How does the community change over time? DistancefromBaseline Baseline Intervention Washout DistancefromBaseline Baseline Intervention Washout Altered-Recovery Altered-Altered Measurements of gene abundance profile (200,000+ genes) reduced to a single distance measurement from the original community (ordination) Baseline Intervention Washout No Change DistancefromBaseline
  51. 51. Rapid and resilient bacterial gut response after diet alteration DistancefromBaseline *** Baseline Intervention Washout
  52. 52. Diet-specific functional total community recovery (mostly bacterial)52 0.000.050.10 DistancefromBaseline Baseline Diet Perturbed Washout ***
  53. 53. 53 0.00.10.20.3 DistancefromBaseline Free living viruses in MF baseline are significantly altered without recovery. Baseline Diet Perturbed Washout ***
  54. 54. Prophages in MF baseline are significantly altered without recovery.54 0.00.10.20.3 DistancefromBaseline Baseline Diet Perturbed Washout
  55. 55. “Combat Zone” as diets change Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in which there is not a rapid recovery of viral communities
  56. 56. Viral functions significantly changed during the milk fat baseline diet56 Decreases in Phage-related (p=0.01) Iron acquisition (p<0.01) Nucleotide metabolism (p=0.02) Carbohydrate metabolism (p=0.01) Motility and chemotaxis (p=0.03) Virulence and defense (p=0.03) Phage Iron Nucleotide Carbs Baseline - Change -- Washout Flagella
  57. 57. 57  Bacteroides (Bacterioidetes)  Clostridium (Firmucutes)  Eubacterium (Firmucutes) Significant decrease in genes associated with MF baseline viruses Ratio of Firmucutes and Bacterioidetes associated with obesity Turnbaugh, 2008 Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic Turnbaugh, 2009
  58. 58. Viromes potentially critical in gut microbiome response.  Members of gut microbiome community do not have co-occuring responses.  Loss of viral population and diversity is diet specific (related to a milkfat to lowfat diet transition)
  59. 59. Ability to redirect structure and function of microbiome makes them pivotal drivers of health and disease Reyes et al, Nature Review Microbiology, 2012 59
  60. 60. Virome directly causes host response Germ Free 11 week old mice (n = 3) Diet: Standard chow 3 week conventionalization 60 A “standard control” Microbiome: Uniform cecal content of standard chow mice Experimentally introduced viruses Mouse Treatment I: Lowfat baseline VLP Mouse Treatment 2: Milkfat baseline VLP Control: Buffer
  61. 61. Significant decrease of intestinal inflammation in LF VLP treatments61 Pro-inflammatory cytokines in mucosal scrapings TNF-α INF-γ Proximal colon TNF-alpha(ng/gl) C ontrol LF VLPs M F VLPs 0 5 10 15 Proximal colon INF-gamma(ng/g) C ontrol LF VLPs M F VLPs 0 10 20 30 *
  62. 62. Conclusions  Gut microbiome has reproducible and distinct responses to diet.  Viruses have a unique response to diet perturbations and do not co-occur with bacteria.  Viruses observed to cause inflammation in infected germ free mice.  Big data workflow enabled strategic sampling design providing unparalleled access to viruses of gut microbiome 62
  63. 63. Future work
  64. 64. Data-discovery is a national investment.
  65. 65. Data-driven biological investigations MICROBES IN ECOSYSTEMS NATURE WATER SOIL MICROBIOMES HUMANS/ANIMAL ENGINEERED WASTEWATER High Throughput Frameworks: Metagenomic Metatranscriptomic Metaproteomic More relevant model systems Improved biomarkers Scaling approaches Big data computation Data driven discovery
  66. 66. Core research values  Research that matters  Developing scientific frameworks that enable open-science initiatives (reproducible science)  Computational and experimental integration  Scale and power to multi-disciplinary approaches  Team value  Flexibility
  67. 67. Going viral: The role of the human gut phageome in inflammatory bowel disease Objectives:  Define and compare core phageomes associated with healthy and diseased gut microbiomes  Determine impact of disease-associated gut phageomes on development of disease in knockout mouse models (predisposed to disease) NIH, National Institute of Diabetes and Digestive and Kidney Diseases; National Institute of Allergy and Infectious Diseases ($3-5M) Source: Nature.com What is the role of host-phage dynamics in the development of intestinal diseases? Integration of multiple datasets Improved model systems and biomarkers
  68. 68. Microbial drivers of carbon metabolism and warming DOE Biological and Environmental Research ($3M/3 years, 40% PI with ISU Kirsten Hofmockel, 2013-2016) Source: Oakridge National LaboratoryContributions: • Omic-based characterization of carbon cycling microorganisms in the soil • Novel approaches to target carbon cycling subsets of community • Improved soil genomic databases to enable future carbon studies Source: Oakridge National LaboratoryHow do microbes contribute to carbon cycling models? Big data scaling Integration of multiple datasets
  69. 69. Large-scale characterization of global dark matter proteins in complex biological environments NIH – Development of Software and Analysis Methods for Biomedical Big Data in Targeted Areas of High Need (~$1M/3 years) Gordon and Betty Moore – Data Driven Discovery Investigator Awards ($1.5M / 5 years) Novel extension of current software tools: • Integration of growing volumes of global public datasets with scalable data-mining analysis • Lightweight data architecture to compare abundance and co- occurrence of sequencing patterns across multiple samples and associated metadata to elucidate information How do we access the novelty observed in metagenomic dataset Big data scaling Integration of datasets
  70. 70. From field to food: The origin and fate of our microbiomes USDA Agriculture and Food Research Initiative ($1- 2.5M) • Identify and characterize under- researched foodborne microbial hazards and effective control strategies • Elucidate fate and dissemination of foodborne microbial hazards associated with produce production and processing Source: aboretum.umn.edu Where do harmful microbes in our food come from and how do we protect ourselves from them? Integration of multiple datasets Improved model systems and
  71. 71. Acknowledgements  Funding  DOE Microbial Carbon Cycling Grant  NSF Postdoc Fellowship, Great Lakes Bioenergy Research Center  Microbiome: University of Chicago Digestive Diseases Research Core Pilot and Feasibility Grant  My Awesome INTER-DISCIPLINARY Team  C. Titus Brown (MSU) + lab (Bioinformatics)  James Tiedje (MSU) + lab (Microbial Ecology)  Daina Ringus (UC) (Microbiology / Mice)  Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)  Eugene Chang (UC)  Folker Meyer (ANL) 71
  72. 72. Questions?
  73. 73. Reducing data, not information. More efficient data storage and mining. Big data scaling approaches
  74. 74. Storage of biological big data  What other sequences are connected to Sequence X?  Data broken into words of length “k” (k-mers)  Overlap (for assembly) = shared “word” Pell, PNAS, 2014 Howe, PNAS, AGTCAGTT Into its 4-mers: AGTC GTCA TCAG CAGT AGTT AGAAAGTC Into its 4-mers: AGAA GAAA AAAG CAGT AGTC
  75. 75. Storage of biological big data  What other sequences are connected to Sequence X?  Data broken into words of length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Efficient storage
  76. 76. Do I have mail?  What other sequences are connected to Sequence X?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy A-G H-R S-Z Pell, PNAS, 2014 Howe, PNAS,
  77. 77.  Is Sequencing A connected to Sequence B?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy – Efficient storage of information A-G H-R S-Z A-G* H-R S-Z No mail for Howe, 100% sure. A-G H-R* S-Z Possibly mail for Howe. Pell, PNAS, 2014 Howe, PNAS, Do I have mail?
  78. 78.  Is Sequencing A connected to Sequence B?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy – Efficient storage of information A-G H-R S-Z A-G H-R* S-Z G-N* A-F; O-T U-Z D-H* A-C; I-O P-Z Howe mail status: Mail possibility higher. Do I have mail?
  79. 79.  Is Sequencing A connected to Sequence B?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy – Efficient storage of information A-G H-R S-Z A-G H-R* S-Z G-N* A-F; O-T U-Z D-H A-C; I-O P-Z Howe mail status: No mail, 100% sure. Do I have mail?
  80. 80. Bloom filter data structure  “Probablistic” data structure  Decrease of false positive rate with multiple bloom filters – “More likely I have mail”  No false negatives – “No mail. 100% sure”  For the win: both detects and counts presence of sequences (k-mers) and their connectivity efficiently  Is sequence A connected to sequence B? Pell, PNAS, 2014 Howe, PNAS,

×