2013 alumni-webinar


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This image depict numerous lymphoma aggregates in the liver
  • Figure 6. IPA Pathway analysis for significantly expressed genes that are Meq-dependent and involved in resistance to MD (A) and MD susceptibility (B). P-value < 0.05 and FDR <0.05 were used as thresholds to select significant canonical pathways.
  • Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  • Larvae/stream bottoms 3-6 years; parasitic adult -> great lakes, 12-20 months feeding. 5-8 years. 40 lbs of fish per life as parasite. 98% of fish in great lakes went away!
  • 2013 alumni-webinar

    1. 1. I’ve got the Big Data Blues C. Titus Brown ctb@msu.edu Microbiology, Computer Science, and BEACON
    2. 2. Outline 1. Genetics 101 and 102 - what you need to know. 2. Marek’s Disease – chicken cancer. 3. Generating lots of data – the sequencing revolution. 4. The problems of data analysis and data integration. 5. Some preliminary results on Marek’s Disease 5. An apparent digression: chess and computers. 6. My actual research :)
    3. 3. Genetics 101: DNA to RNA to protein to phenotype… Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal http://commons.wikimedia.org/wiki/File:Spombe_Pop2p_protein_stru cture_rainbow.png; http://commons.wikimedia.org/wiki/File:Protein_CA2_PDB_12ca.png
    4. 4. …plus diploidy (2x each chromosome) Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal GT A C
    5. 5. …plus regulation and interaction. Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal GT A C Regulation Interaction
    7. 7. Herpesvirus and Cancer • Epstein-Barr Virus – Burkitt’s lymphoma – Hodgkin’s lymphoma – Nasopharyngeal carcinoma • Herpes Virus-8 – Kaposi’s sarcoma – Multicentric lymphoma • Mardivirus – Marek’s Disease • Viral neoplastic disease • Alpha-herpesvirus • Model for Burkitt’s lymphoma (slide courtesy Suga Subramanian)
    8. 8. Clinical Signs Asymmetric Paralysis http://partnersah.vet.cornell.edu/avian-atlas/
    9. 9. Visceral Lymphoma Liver NORMAL LYMPHOMA Courtesy: John Dunn, USDA
    10. 10. Importance of Marek’s Disease • Agricultural Impact – Economic losses (2 billion) – Viral evolution: Increased virulence – Current Vaccines: Not enough – Long term viral persistence • Model Sytem – Human herpes viral infections – Viral induced lymphoma (slide courtesy Suga Subramanian)
    12. 12. What happens when we infect? Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal GT A C Regulation Interaction Infect with virus ?
    13. 13. …how does the virus specifically interact with genes? Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal GT A C Regulation Interaction Infect with virus ? Mechanism of regulation?
    14. 14. …and what are the mechanisms of resistance? Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal GT A C Regulation Interaction Infect with virus ? Mechanism of resistance?
    15. 15. Digression: DNA sequencing • Observation of actual DNA sequence • Counting of molecules Image: Werner Van Belle
    16. 16. Fast, cheap, and easy to generate. Image: Werner Van Belle
    17. 17. Applying sequencing to Marek’s Disease Genome (DNA) Transcripts (Genes; RNA) Proteins (Amino acids) Animal GT A C Regulation Interaction SEQUENCING
    18. 18. Differentially expressed genes (DEG) due to infection Gene GO Analysis, IPA Pathway Analysis DEGs in Md5-infected and not in Md5ΔMeq-infected groups YES NO Meq-dependent DEGs DEGs not dependent on Meq DEGs in Line 6 and not in Line 7 DEGs in Line 7 and not in Line 6 YES NO NO YES Meq-dependent DEGs involved in MD resistance Meq-dependent DEGs involved in MD susceptibility Meq-dependent DEGs common to both lines Back to Marek’s disease: (slide courtesy Suga Subramanian)
    19. 19. LINE 6 MD-RESISTANCE: ROLE OF MEQ MDV MDV-no Meq Genes involved in MD-resistance that are regulated by Meq Genes involved in MD-resistance that are not regulated by Meq 1031 1670 (slide courtesy Suga Subramanian)
    20. 20. Pathway Analysis: MD resistance (slide courtesy Suga Subramanian)
    21. 21. LINE 7 MD-SUSCEPTIBILITY: ROLE OF MEQ MDV MDV-no Meq Genes involved in MD-susceptibility that are regulated by Meq Genes involved in MD-susceptibility that are not regulated by Meq 650 540 (slide courtesy Suga Subramanian)
    22. 22. Pathway Analysis: MD susceptibility (slide courtesy Suga Subramanian)
    23. 23. Next problem: data analysis & integration! • Once you can generate virtually any data set you want… • …the next problem becomes finding your answer in the data set! • Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…
    24. 24. Digression: “Heuristics” • What do computers do when the answer is either really, really hard to compute exactly, or actually impossible? • They approximate! Or guess! • The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.
    25. 25. Often explicit or implicit tradeoffs between compute “amount” and quality of result http://www.infernodevelopment.com/how- computer-chess-engines-think-minimax-tree
    26. 26. My actual research focus What we do is think about ways to get computers to play chess better, by: – Identifying better ways to guess; – Speeding up the guessing process; – Improving people’s ability to use the chess playing computer Now, replace “play chess” with “analyze biological data”...
    27. 27. My actual research focus… We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their scientific questions. This touches on many problems, including: • Computational and scientific correctness. • Computational efficiency. • Cultural divides between experimental biologists and computational scientists. • Lack of training (biology and medical curricula devoid of math and computing).
    28. 28. Not-so-secret sauce: “digital normalization” • One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.
    29. 29. http://en.wikipedia.org/wiki/JPEG Lossy compression
    30. 30. http://en.wikipedia.org/wiki/JPEG Lossy compression
    31. 31. http://en.wikipedia.org/wiki/JPEG Lossy compression
    32. 32. http://en.wikipedia.org/wiki/JPEG Lossy compression
    33. 33. http://en.wikipedia.org/wiki/JPEG Lossy compression
    34. 34. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Restated: Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.) ~2 GB – 2 TB of single-chassis RAM
    35. 35. Some diginorm examples: 1. Assembly of the H. contortus parasitic nematode genome. 2. Assembly of two Midwest soil metagenomes, Iowa corn and Iowa prairie. 3. Reference-free assembly of the lamprey (P. marinus) transcriptome.
    36. 36. 1. The H. contortus problem • A sheep parasite. • ~350 Mbp genome • Sequenced DNA 6 individuals after whole genome amplification, estimated 10% heterozygosity (!?) • Significant bacterial contamination. (w/Robin Gasser, Paul Sternberg, and Erich Schwarz)
    37. 37. H. contortus life cycle Refs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868; Prichard and Geary (2008), Nature 452, 157-158.
    38. 38. Assembly after digital normalization • Diginorm readily enabled assembly of a 404 Mbp genome with N50 of 15.6 kb; • Post-processing led to 73-94% complete genome. • Diginorm helped by making analysis possible. – Highly variable population. – Lots of contamination from microbes.
    39. 39. Next steps with H. contortus • Publish the genome paper  • Identification of antibiotic targets for treatment in agricultural settings (animal husbandry). • Serving as “reference approach” for a wide variety of parasitic nematodes, many of which have similar genomic issues.
    40. 40. 2. Soil metagenome assembly
    41. 41. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
    42. 42. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
    43. 43. 3. Sea lamprey gene expression • Non-native • Parasite of medium to large fishes • Caused populations of host fishes to crash Li Lab / Y-W C-D
    44. 44. Transcriptome results • Started with 5.1 billion reads from 50 different tissues. (4 years of computational research, and about 1 month of compute time, GO HERE) • Final assembly contains ~95% of genes (est.) • This is an extra 40% over previous work. • Enabling studies in – – Basal vertebrate phylogeny – Biliary atresia – Evolutionary origin of brown fat (previously thought to be mammalian only!) – J Exp Biol. 2013 – Pheromonal response in adults
    45. 45. What are the tissue level changes in gene expression that support regeneration? Transcriptome analysis of a regenerating vertebrate after SCI brain spinal cord RNA-Seq to determine differential expression profile after injury Sampling >weekly -/+ Dex Ona Bloom
    46. 46. Challenges ahead • We need more people working at the interface – “Priesthood” model doesn’t scale! – Cultural shifts in biology needed… • We need more data! – Data often only makes sense in context of other data – This is a hard sell: “if you give us 1000x as much data, we might start to develop some idea of what it means.” • We actually know very little about biology still!
    47. 47. Open science & sharing • Science, and biology in particular, is in the middle of a transition to a “data intensive” field. • The sharing ethos is not incentivized properly; you get more credit for discovering new stuff than for discoveries resulting from sharing. • We are focused on sharing: methods, programs, educational materials…
    48. 48. Being disruptive? Possible initiative from my lab: “We will analyze your data for you if we can make your data openly available in 1 yr.” Will it work, or sink like a stone? Ask me in a year 
    49. 49. MSU’s role in my research • MSU provides nice infrastructure, great administrative support, and a truly excellent community (students, profs, and other researchers). • MSU is also uniquely interdisciplinary in many ways; very few “hard” boundaries in biology research.
    50. 50. Credits • Marek’s Disease: Suga Subramanian and Hans Cheng (USDA) • Haemonchus: Erich Schwarz (Caltech/Cornell), Paul Sternberg (Caltech), Robin Gasser (U. Melbourne) • Lamprey: Weiming Li (MSU), Ona Bloom (Feinstein), Jen Morgan (MBL/Woods Hole) • Great Prairie: Jim Tiedje (MSU), Janet Jansson (LBL), Susanna Tringe (Joint Genome Inst.) Funding: MSU; USDA; NSF; NIH. Drop me a line – ctb@msu.edu