Your SlideShare is downloading. ×
Trends In Genomics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Trends In Genomics


Published on

The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.

The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.

Published in: Business, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • With the publication of the genomes of Craig Venter and Jim Watson, and with many additional human genomes being sequenced, the era of personal genomics is here.We are going to need really good tools to take advantage of this flood of data. My goal today is to share our experience building tools to understand the variation within a single individual’s genome, and try to extrapolate forward to what we will need to understand larger collections of genomes.
  • * A chromosome or sequence id followed by a start position and region length e.g., "chr19:450000+100000" to display the region from 450000-550000 on chromosome 19. * A dbSNP id e.g., "rs2691286" * An Ensembl annotation identifer e.g., "ENSG00000104783" * A gene name, e.g. "KLKB1", optionally followed by the amount of flanking sequence to display e.g., "KLKB1^2000"
  • Zinc Finger example whole transcript ENST00000334564
  • INSERT IS 467 BP  TRUNCATES THE PROTEINVNTRPROB HETEROzygousPink = non-synYellow –synpnymous
  • Transcript

    • 1. Trends in Genomics: An Engineer’s Perspective
      Saul A. Kravitz, PhD
      December 2009
    • 2. Biggest Change: Sequencing is free
      2000: Factory, AB3700 @ Celera
      - 1k 500bp reads/day/sequener = 0.5Mbp/day
      - Human Genome = ~ 190 sequencer yr, ~200M$
      2002: Factory, AB3730 @ JCVI
      - 10k 500bp reads/sequencer/day = 5Mbp/day
      - Human Genome = ~ 19 sequencer yr, ~10M$
      2010: Benchtop, 454 GS Junior
      - 70M 500bp reads/day = 35Gbp/day
      - Human genome = ~ 1 sequencer day, ~10k$
      2010: Service, Complete Genomics
      - Human genome = ~ 1 day, ~1k$
    • 3. New Bottlenecks
      Generating sequence data – free
      Data Management
      Data Query
      Data Analysis
      Breadth: Communities
      Depth: Populations (e.g., flu, human)
      Thinking is very pricy!
    • 4. Same Thinking $, More Data
      Project Cost
    • 5. The Crux of the Problem
      Genomic data interpreted in context
      How does my genome compare to all others
      Which other proteins are similar to mine
      Size of context is growing exponentially
      Growth is faster than Moore’s law
      Hard to fight an exponential
      BLASTP against NCBI NR
      All against all BLASTP of microbial proteins
    • 6. Bioinformatics Isn’t High Energy Physics
      Data inputs are changing rapidly
      CE Chromatograms, 454 Flowgrams, Color Space
      Error models and read lengths are changing rapidly
      Tools evolving rapidly
      Difficult to track many academic tools
      High quality commercial platforms emerge
      Even when “cooks” use shared “ingredients” “recipes” vary widely
      Faith based science
      My dataset alone has limited value
      Computations are (relatively) IO Intensive
    • 7. Some Solutions and Directions
      Repeated process must be automated
      Even if labor is free, deviations from SOP costly
      Commercial Tools
      Market has expanded, quality improved
      Tools for exploring Human Variation
      The HuRef Browser
      Metagenomics Tools and Challenges
      Global Ocean Sampling Expedition
      Visualization tools
      Metagenomic Annotation
      Genome Standards Consortium and M5
      Clouds and Grids
      ScaaS: Science as a Service
    • 8. Personal Genomics:
      The future is now (ca 2008)
    • 9. HuRef Browser: Accelerate thinking
      Compare 2 published genomes
      Craig Venter’s Diploid Genome
      Composite NCBI-36
      Are differences real?
      Noisy data?
      Assembly errors?
      Analysis errors?
      Methods development requires curation by biologists
      As genomes accumulate, more acute challenge
    • 10. HuRef Browser:
    • 11. Zinc Finger ProteinChr19:57564487-57581356
      Haplotype Blocks
      Assembly-Assembly Mapping
      Assembly Structure
    • 12. Protein Truncated by 476 bp Insertion
      Heterozygous SNP
      Homozygous SNP
    • 13. Assembly Structure
    • 14. Genomics vs Metagenomics
      Genomics – ‘Old School’
      Study of a single organism's genome
      Genome sequence determined using shotgun sequencing and assembly
      >1300 microbes sequenced, first in 1995 (at TIGR)
      DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells
      Use genomics tricks on communities – no culturing
      Environmental shotgun sequencing of DNA or RNA
      Metadata provides context
    • 15. Metagenomic Questions
      Within an environment
      What biological functions are present (absent)?
      What organisms are present (absent)?
      Compare data from (dis)similar environments
      What are the fundamental rules of microbial ecology
      Adapting to environmental conditions?
      How do communities respond to stimuli?
      How does community structure change?
      Search for novel proteins and protein families
      And diversity within known families
    • 16. Global Ocean Sampling Expedition
    • 17. Global Ocean Sampling Expedition
      • 178 Total Sampling Locations
      • 18. Pilot: 2.0M reads 4/04
      • 19. Phase 1: 7.7M reads, >6M proteins 3/07
      • 20. Phase 2-IO: 2.2M reads 3/08
      • 21. Phase 2: ~30M reads 2010?
      • 22. Diverse Environments
      • 23. Open ocean, estuary, embayment, upwelling, fringing reef, atoll…
    • 24. GOS: Sequence Diversity in the OceanRusch et al (PLoS Biology2007)
      Most sequence reads are unique
      Very limited assembly
      Most sequences not taxonomically anchored
      Reference genomes a basis set? Not really.
      Several hundred isolates
      Relating shotgun data to reference genomes
      Structural and Functional Annotation
    • 25. Browsing Large Data Collections: Fragment Recruitment Viewer
      Microbial Communities vs Reference Genomes
      Millions of sequence reads vs Thousands of genomes
      Definition: A read is recruited to a sequence if:
      End-to-end blastN alignment exists
      Rapid Hypothesis Generation and Exploration
      How do cultured and wildtype genomes differ?
      Insertions, deletion, translocations
      Correlation with environmental factors
    • 26. Fragment Recruitment Viewer
      Sequence Similarity
      Genomic Position
      Doug Rusch, JCVI
    • 27. Doug Rusch and Michael Press
    • 28. Doug Rusch and Michael Press
    • 29. GOS Protein AnalysisYooseph et al (PLoS Biology 2007)
      Novel clustering process
      • Sequence similarity based
      • 30. Predict putative proteins and group into related clusters
      • 31. Include GOS and all known proteins
      • GOS proteins
      • 32. cover ~all existing prokaryotic families
      • 33. expands diversity of known protein families
      • 34. ~10% of large clusters are novel
      • 35. Many are of viral origin
      • 36. No saturation in the rate of novel protein family discovery
    • Added Protein Family Diversity
      Yooseph et al (PLoS 2007)
      Rubisco homologs
      Known eukaryotes
      Known prokaryotes
      GOS prokaryotes
      New Groups
    • 37. Annotation ofEnvironmental Shotgun Data
      Lack of context
      Protein fragments
      Gene Finding
      Yooseph’s Protein Clusters + Metagene
      Functional Assignment
      Variation of JCVI prok annotation pipeline*
      Leverages protein cluster annotation -- soon
      Quality Nearly Comparable to Prokaryotic Genomic Annotation
    • 38. Protein ClustersAdvantages and Disadvantages
      Stateful (also a strength)
      Less sensitive (for now)
      Exponential  Linear?
      Learns over time
      Easy to maintain
    • 39. Increasing the pressure
      Nextgen + Metagenomics
      Deeper collections
      Short sequences  less informative
      How should we annotate?
      When in doubt, use BLAST against NRAA, and other large and fast-growing collections
      Annotation needs growing dramatically
      24x7 quality software
      Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE?
      New algorithms?
      Back to supercomputers?
      Sharing data and computes
      Standardization of data, metadata, and computes
      Folker Meyer, ANL
    • 40. Science as a Service (ScaaS)
      Standard tools as services
      Service-Oriented Architecture
      Supported by HPC as necessary
      Grid workflow for integration
      Maintain tools & data in scalable compute environment
      Celera Assembler in the clouds
    • 41. Vision for High Throughput Science
      Construction of the Ark. Nuremberg Chronicle (1493).
    • 42. Vision for High Throughput Science
      Rodin’s Thinker
    • 43. Credits
      JCVI Informatics Team
      Gordon and Betty Moore Foundation