Trends In Genomics
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Trends In Genomics

  • 2,581 views
Uploaded on

The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.

The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.

More in: Business , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,581
On Slideshare
2,570
From Embeds
11
Number of Embeds
3

Actions

Shares
Downloads
42
Comments
0
Likes
0

Embeds 11

http://www.slideshare.net 5
http://www.linkedin.com 5
http://www.lmodules.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • With the publication of the genomes of Craig Venter and Jim Watson, and with many additional human genomes being sequenced, the era of personal genomics is here.We are going to need really good tools to take advantage of this flood of data. My goal today is to share our experience building tools to understand the variation within a single individual’s genome, and try to extrapolate forward to what we will need to understand larger collections of genomes.
  • * A chromosome or sequence id followed by a start position and region length e.g., "chr19:450000+100000" to display the region from 450000-550000 on chromosome 19. * A dbSNP id e.g., "rs2691286" * An Ensembl annotation identifer e.g., "ENSG00000104783" * A gene name, e.g. "KLKB1", optionally followed by the amount of flanking sequence to display e.g., "KLKB1^2000"
  • Zinc Finger example whole transcript ENST00000334564
  • INSERT IS 467 BP  TRUNCATES THE PROTEINVNTRPROB HETEROzygousPink = non-synYellow –synpnymous

Transcript

  • 1. Trends in Genomics: An Engineer’s Perspective
    Saul A. Kravitz, PhD
    December 2009
  • 2. Biggest Change: Sequencing is free
    2000: Factory, AB3700 @ Celera
    - 1k 500bp reads/day/sequener = 0.5Mbp/day
    - Human Genome = ~ 190 sequencer yr, ~200M$
    2002
    2002: Factory, AB3730 @ JCVI
    - 10k 500bp reads/sequencer/day = 5Mbp/day
    - Human Genome = ~ 19 sequencer yr, ~10M$
    2010
    2010: Benchtop, 454 GS Junior
    - 70M 500bp reads/day = 35Gbp/day
    - Human genome = ~ 1 sequencer day, ~10k$
    2010: Service, Complete Genomics
    - Human genome = ~ 1 day, ~1k$
  • 3. New Bottlenecks
    Generating sequence data – free
    Data Management
    Data Query
    Data Analysis
    Breadth: Communities
    Depth: Populations (e.g., flu, human)
    Thinking is very pricy!
  • 4. Same Thinking $, More Data
    Project Cost
  • 5. The Crux of the Problem
    Genomic data interpreted in context
    How does my genome compare to all others
    Which other proteins are similar to mine
    Size of context is growing exponentially
    Growth is faster than Moore’s law
    Hard to fight an exponential
    BLASTP against NCBI NR
    All against all BLASTP of microbial proteins
  • 6. Bioinformatics Isn’t High Energy Physics
    Data inputs are changing rapidly
    CE Chromatograms, 454 Flowgrams, Color Space
    Error models and read lengths are changing rapidly
    Tools evolving rapidly
    Difficult to track many academic tools
    High quality commercial platforms emerge
    Even when “cooks” use shared “ingredients” “recipes” vary widely
    Faith based science
    My dataset alone has limited value
    Computations are (relatively) IO Intensive
  • 7. Some Solutions and Directions
    Repeated process must be automated
    Even if labor is free, deviations from SOP costly
    Commercial Tools
    Market has expanded, quality improved
    Tools for exploring Human Variation
    The HuRef Browser
    Metagenomics Tools and Challenges
    Global Ocean Sampling Expedition
    Visualization tools
    Metagenomic Annotation
    Genome Standards Consortium and M5
    Clouds and Grids
    ScaaS: Science as a Service
  • 8. Personal Genomics:
    The future is now (ca 2008)
  • 9. HuRef Browser: Accelerate thinking
    Compare 2 published genomes
    Craig Venter’s Diploid Genome
    Composite NCBI-36
    Are differences real?
    Noisy data?
    Assembly errors?
    Analysis errors?
    Methods development requires curation by biologists
    As genomes accumulate, more acute challenge
  • 10. HuRef Browser: http://huref.jcvi.org
  • 11. Zinc Finger ProteinChr19:57564487-57581356
    Transcript
    Gene
    Haplotype Blocks
    Variations
    NCBI-36
    Assembly-Assembly Mapping
    HuRef
    Assembly Structure
  • 12. Protein Truncated by 476 bp Insertion
    Heterozygous SNP
    Homozygous SNP
    Insertion
  • 13. Assembly Structure
    Insertion
  • 14. Genomics vs Metagenomics
    Genomics – ‘Old School’
    Study of a single organism's genome
    Genome sequence determined using shotgun sequencing and assembly
    >1300 microbes sequenced, first in 1995 (at TIGR)
    DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells
    Metagenomics
    Use genomics tricks on communities – no culturing
    Environmental shotgun sequencing of DNA or RNA
    Metadata provides context
  • 15. Metagenomic Questions
    Within an environment
    What biological functions are present (absent)?
    What organisms are present (absent)?
    Compare data from (dis)similar environments
    What are the fundamental rules of microbial ecology
    Adapting to environmental conditions?
    How do communities respond to stimuli?
    How does community structure change?
    Search for novel proteins and protein families
    And diversity within known families
  • 16. Global Ocean Sampling Expedition
  • 17. Global Ocean Sampling Expedition
    • 178 Total Sampling Locations
    • 18. Pilot: 2.0M reads 4/04
    • 19. Phase 1: 7.7M reads, >6M proteins 3/07
    • 20. Phase 2-IO: 2.2M reads 3/08
    • 21. Phase 2: ~30M reads 2010?
    • 22. Diverse Environments
    • 23. Open ocean, estuary, embayment, upwelling, fringing reef, atoll…
    4/04
    3/07
    3/08
  • 24. GOS: Sequence Diversity in the OceanRusch et al (PLoS Biology2007)
    Most sequence reads are unique
    Very limited assembly
    Most sequences not taxonomically anchored
    Reference genomes a basis set? Not really.
    Several hundred isolates
    Challenges
    Relating shotgun data to reference genomes
    Structural and Functional Annotation
  • 25. Browsing Large Data Collections: Fragment Recruitment Viewer
    Microbial Communities vs Reference Genomes
    Millions of sequence reads vs Thousands of genomes
    Definition: A read is recruited to a sequence if:
    End-to-end blastN alignment exists
    Rapid Hypothesis Generation and Exploration
    How do cultured and wildtype genomes differ?
    Insertions, deletion, translocations
    Correlation with environmental factors
  • 26. Fragment Recruitment Viewer
    Sequence Similarity
    Genomic Position
    Doug Rusch, JCVI
  • 27. Doug Rusch and Michael Press
  • 28. Doug Rusch and Michael Press
  • 29. GOS Protein AnalysisYooseph et al (PLoS Biology 2007)
    Novel clustering process
    • Sequence similarity based
    • 30. Predict putative proteins and group into related clusters
    • 31. Include GOS and all known proteins
    Findings
    • GOS proteins
    • 32. cover ~all existing prokaryotic families
    • 33. expands diversity of known protein families
    • 34. ~10% of large clusters are novel
    • 35. Many are of viral origin
    • 36. No saturation in the rate of novel protein family discovery
  • Added Protein Family Diversity
    Yooseph et al (PLoS 2007)
    Rubisco homologs
    Known eukaryotes
    Known prokaryotes
    GOS prokaryotes
    New Groups
  • 37. Annotation ofEnvironmental Shotgun Data
    Challenges:
    Lack of context
    Protein fragments
    Gene Finding
    Yooseph’s Protein Clusters + Metagene
    Functional Assignment
    Variation of JCVI prok annotation pipeline*
    Leverages protein cluster annotation -- soon
    Result:
    Quality Nearly Comparable to Prokaryotic Genomic Annotation
  • 38. Protein ClustersAdvantages and Disadvantages
    Weaknesses
    Homology-based
    Stateful (also a strength)
    Less sensitive (for now)
    Strengths
    Exponential  Linear?
    Learns over time
    Easy to maintain
  • 39. Increasing the pressure
    Nextgen + Metagenomics
    Deeper collections
    Short sequences  less informative
    How should we annotate?
    When in doubt, use BLAST against NRAA, and other large and fast-growing collections
    Annotation needs growing dramatically
    24x7 quality software
    Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE?
    New algorithms?
    Back to supercomputers?
    Sharing data and computes
    Standardization of data, metadata, and computes
    Folker Meyer, ANL
  • 40. Science as a Service (ScaaS)
    Standard tools as services
    Service-Oriented Architecture
    Supported by HPC as necessary
    Grid workflow for integration
    Maintain tools & data in scalable compute environment
    Celera Assembler in the clouds
  • 41. Vision for High Throughput Science
    Today:
    Scientist
    Construction of the Ark. Nuremberg Chronicle (1493).
  • 42. Vision for High Throughput Science
    Engineers
    Scientist
    +
    http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html
    Rodin’s Thinker
  • 43. Credits
    JCVI Informatics Team
    Support
    DOE
    Gordon and Betty Moore Foundation
    NIAID