Trends in Genomics: An Engineer’s Perspective<br />Saul A. Kravitz, PhD<br />December 2009<br />
Biggest Change:  Sequencing is free<br />2000:   Factory, AB3700 @ Celera<br /> - 1k 500bp reads/day/sequener = 0.5Mbp/day...
New Bottlenecks<br />Generating sequence data – free<br />Data Management<br />Data Query<br />Data Analysis<br />Breadth:...
Same Thinking $, More Data<br />Project Cost<br />
The Crux of the Problem<br />Genomic data interpreted in context<br />How does my genome compare to all others<br />Which ...
Bioinformatics Isn’t High Energy Physics<br />Data inputs are changing rapidly<br />CE Chromatograms, 454 Flowgrams, Color...
Some Solutions and Directions<br />Repeated process must be automated<br />Even if labor is free, deviations from SOP cost...
Personal Genomics:   <br />The future is now  (ca 2008)<br />
HuRef Browser:  Accelerate thinking<br />Compare 2 published genomes<br />Craig Venter’s Diploid Genome<br />Composite NCB...
HuRef Browser: http://huref.jcvi.org<br />
Zinc Finger ProteinChr19:57564487-57581356<br />Transcript<br />Gene<br />Haplotype Blocks<br />Variations<br />NCBI-36<br...
Protein Truncated by 476 bp Insertion<br />Heterozygous SNP<br />Homozygous SNP<br />Insertion<br />
Assembly Structure<br />Insertion<br />
Genomics vs Metagenomics<br />Genomics – ‘Old School’<br />Study of a single organism&apos;s genome <br />Genome sequence ...
Metagenomic Questions<br />Within an environment<br />What biological functions are present (absent)?<br />What organisms ...
Global Ocean Sampling Expedition<br />
Global Ocean Sampling Expedition <br /><ul><li>178 Total Sampling Locations
Pilot:	      2.0M reads		        4/04
Phase 1:         7.7M reads, >6M proteins    3/07
Phase 2-IO:    2.2M reads                           3/08
Phase 2:       ~30M  reads                           2010?
Diverse Environments
Open ocean, estuary, embayment, upwelling, fringing reef, atoll…</li></ul>4/04<br />3/07<br />3/08<br />
GOS:  Sequence Diversity in the OceanRusch et al (PLoS Biology2007)<br />Most sequence reads are unique<br />Very limited ...
Browsing Large Data Collections: Fragment Recruitment Viewer<br />Microbial Communities vs Reference Genomes<br />Millions...
Fragment Recruitment Viewer<br />Sequence Similarity<br />Genomic Position<br />Doug Rusch, JCVI<br />
Doug Rusch  and Michael Press<br />
Doug Rusch  and Michael Press<br />
GOS Protein AnalysisYooseph et al (PLoS Biology 2007)<br />Novel clustering process<br /><ul><li>Sequence similarity based
Predict putative proteins and group into related clusters
Include GOS and all known proteins</li></ul>Findings<br /><ul><li>GOS proteins
Upcoming SlideShare
Loading in …5
×

Trends In Genomics

2,319 views

Published on

The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,319
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
49
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • With the publication of the genomes of Craig Venter and Jim Watson, and with many additional human genomes being sequenced, the era of personal genomics is here.We are going to need really good tools to take advantage of this flood of data. My goal today is to share our experience building tools to understand the variation within a single individual’s genome, and try to extrapolate forward to what we will need to understand larger collections of genomes.
  • * A chromosome or sequence id followed by a start position and region length e.g., "chr19:450000+100000" to display the region from 450000-550000 on chromosome 19. * A dbSNP id e.g., "rs2691286" * An Ensembl annotation identifer e.g., "ENSG00000104783" * A gene name, e.g. "KLKB1", optionally followed by the amount of flanking sequence to display e.g., "KLKB1^2000"
  • Zinc Finger example whole transcript ENST00000334564
  • INSERT IS 467 BP  TRUNCATES THE PROTEINVNTRPROB HETEROzygousPink = non-synYellow –synpnymous
  • Trends In Genomics

    1. 1. Trends in Genomics: An Engineer’s Perspective<br />Saul A. Kravitz, PhD<br />December 2009<br />
    2. 2. Biggest Change: Sequencing is free<br />2000: Factory, AB3700 @ Celera<br /> - 1k 500bp reads/day/sequener = 0.5Mbp/day<br />- Human Genome = ~ 190 sequencer yr, ~200M$<br />2002<br />2002: Factory, AB3730 @ JCVI<br /> - 10k 500bp reads/sequencer/day = 5Mbp/day<br />- Human Genome = ~ 19 sequencer yr, ~10M$<br />2010<br />2010: Benchtop, 454 GS Junior<br /> - 70M 500bp reads/day = 35Gbp/day<br /> - Human genome = ~ 1 sequencer day, ~10k$<br />2010: Service, Complete Genomics<br />- Human genome = ~ 1 day, ~1k$<br />
    3. 3. New Bottlenecks<br />Generating sequence data – free<br />Data Management<br />Data Query<br />Data Analysis<br />Breadth: Communities<br />Depth: Populations (e.g., flu, human)<br />Thinking is very pricy!<br />
    4. 4. Same Thinking $, More Data<br />Project Cost<br />
    5. 5. The Crux of the Problem<br />Genomic data interpreted in context<br />How does my genome compare to all others<br />Which other proteins are similar to mine<br />Size of context is growing exponentially<br />Growth is faster than Moore’s law<br />Hard to fight an exponential<br />BLASTP against NCBI NR<br />All against all BLASTP of microbial proteins<br />
    6. 6. Bioinformatics Isn’t High Energy Physics<br />Data inputs are changing rapidly<br />CE Chromatograms, 454 Flowgrams, Color Space<br />Error models and read lengths are changing rapidly<br />Tools evolving rapidly<br />Difficult to track many academic tools<br />High quality commercial platforms emerge<br />Even when “cooks” use shared “ingredients” “recipes” vary widely<br />Faith based science<br />My dataset alone has limited value<br />Computations are (relatively) IO Intensive<br />
    7. 7. Some Solutions and Directions<br />Repeated process must be automated<br />Even if labor is free, deviations from SOP costly<br />Commercial Tools<br />Market has expanded, quality improved<br />Tools for exploring Human Variation<br />The HuRef Browser<br />Metagenomics Tools and Challenges<br />Global Ocean Sampling Expedition<br />Visualization tools<br />Metagenomic Annotation<br />Genome Standards Consortium and M5<br />Clouds and Grids<br />ScaaS: Science as a Service<br />
    8. 8. Personal Genomics: <br />The future is now (ca 2008)<br />
    9. 9. HuRef Browser: Accelerate thinking<br />Compare 2 published genomes<br />Craig Venter’s Diploid Genome<br />Composite NCBI-36<br />Are differences real? <br />Noisy data?<br />Assembly errors?<br />Analysis errors?<br />Methods development requires curation by biologists<br />As genomes accumulate, more acute challenge<br />
    10. 10. HuRef Browser: http://huref.jcvi.org<br />
    11. 11. Zinc Finger ProteinChr19:57564487-57581356<br />Transcript<br />Gene<br />Haplotype Blocks<br />Variations<br />NCBI-36<br />Assembly-Assembly Mapping<br />HuRef<br />Assembly Structure<br />
    12. 12. Protein Truncated by 476 bp Insertion<br />Heterozygous SNP<br />Homozygous SNP<br />Insertion<br />
    13. 13. Assembly Structure<br />Insertion<br />
    14. 14. Genomics vs Metagenomics<br />Genomics – ‘Old School’<br />Study of a single organism&apos;s genome <br />Genome sequence determined using shotgun sequencing and assembly<br />&gt;1300 microbes sequenced, first in 1995 (at TIGR)<br />DNA usually obtained from pure cultures (&lt;1%) or amplication of DNA from single cells <br />Metagenomics <br />Use genomics tricks on communities – no culturing<br />Environmental shotgun sequencing of DNA or RNA<br />Metadata provides context<br />
    15. 15. Metagenomic Questions<br />Within an environment<br />What biological functions are present (absent)?<br />What organisms are present (absent)?<br />Compare data from (dis)similar environments<br />What are the fundamental rules of microbial ecology <br />Adapting to environmental conditions?<br />How do communities respond to stimuli?<br />How does community structure change?<br />Search for novel proteins and protein families<br />And diversity within known families<br />
    16. 16. Global Ocean Sampling Expedition<br />
    17. 17. Global Ocean Sampling Expedition <br /><ul><li>178 Total Sampling Locations
    18. 18. Pilot: 2.0M reads 4/04
    19. 19. Phase 1: 7.7M reads, >6M proteins 3/07
    20. 20. Phase 2-IO: 2.2M reads 3/08
    21. 21. Phase 2: ~30M reads 2010?
    22. 22. Diverse Environments
    23. 23. Open ocean, estuary, embayment, upwelling, fringing reef, atoll…</li></ul>4/04<br />3/07<br />3/08<br />
    24. 24. GOS: Sequence Diversity in the OceanRusch et al (PLoS Biology2007)<br />Most sequence reads are unique<br />Very limited assembly<br />Most sequences not taxonomically anchored<br />Reference genomes a basis set? Not really.<br />Several hundred isolates<br />Challenges<br />Relating shotgun data to reference genomes<br />Structural and Functional Annotation<br />
    25. 25. Browsing Large Data Collections: Fragment Recruitment Viewer<br />Microbial Communities vs Reference Genomes<br />Millions of sequence reads vs Thousands of genomes<br />Definition: A read is recruited to a sequence if:<br />End-to-end blastN alignment exists<br />Rapid Hypothesis Generation and Exploration<br />How do cultured and wildtype genomes differ?<br />Insertions, deletion, translocations<br />Correlation with environmental factors<br />
    26. 26. Fragment Recruitment Viewer<br />Sequence Similarity<br />Genomic Position<br />Doug Rusch, JCVI<br />
    27. 27. Doug Rusch and Michael Press<br />
    28. 28. Doug Rusch and Michael Press<br />
    29. 29. GOS Protein AnalysisYooseph et al (PLoS Biology 2007)<br />Novel clustering process<br /><ul><li>Sequence similarity based
    30. 30. Predict putative proteins and group into related clusters
    31. 31. Include GOS and all known proteins</li></ul>Findings<br /><ul><li>GOS proteins
    32. 32. cover ~all existing prokaryotic families
    33. 33. expands diversity of known protein families
    34. 34. ~10% of large clusters are novel
    35. 35. Many are of viral origin
    36. 36. No saturation in the rate of novel protein family discovery</li></li></ul><li>Added Protein Family Diversity<br />Yooseph et al (PLoS 2007)<br />Rubisco homologs<br />Known eukaryotes<br />Known prokaryotes<br />GOS prokaryotes<br /> New Groups<br />
    37. 37. Annotation ofEnvironmental Shotgun Data<br />Challenges:<br />Lack of context<br />Protein fragments<br />Gene Finding<br />Yooseph’s Protein Clusters + Metagene<br />Functional Assignment<br />Variation of JCVI prok annotation pipeline*<br />Leverages protein cluster annotation -- soon<br />Result:<br />Quality Nearly Comparable to Prokaryotic Genomic Annotation<br />
    38. 38. Protein ClustersAdvantages and Disadvantages<br />Weaknesses<br />Homology-based<br />Stateful (also a strength)<br />Less sensitive (for now)<br />Strengths<br />Exponential  Linear?<br />Learns over time<br />Easy to maintain<br />
    39. 39. Increasing the pressure<br />Nextgen + Metagenomics<br />Deeper collections<br />Short sequences  less informative<br />How should we annotate?<br />When in doubt, use BLAST against NRAA, and other large and fast-growing collections<br />Annotation needs growing dramatically<br />24x7 quality software<br />Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE?<br />New algorithms?<br />Back to supercomputers?<br />Sharing data and computes<br />Standardization of data, metadata, and computes<br />Folker Meyer, ANL<br />
    40. 40. Science as a Service (ScaaS)<br />Standard tools as services<br />Service-Oriented Architecture<br />Supported by HPC as necessary<br />Grid workflow for integration<br />Maintain tools & data in scalable compute environment<br />Celera Assembler in the clouds<br />
    41. 41. Vision for High Throughput Science<br />Today:<br />Scientist<br />Construction of the Ark. Nuremberg Chronicle (1493).<br />
    42. 42. Vision for High Throughput Science<br />Engineers<br />Scientist<br />+<br />http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html<br />Rodin’s Thinker<br />
    43. 43. Credits<br />JCVI Informatics Team<br />Support<br />DOE<br />Gordon and Betty Moore Foundation<br />NIAID<br />

    ×