Trends In Genomics

Trends in Genomics: An Engineer’s Perspective Saul A. Kravitz, PhD December 2009

Biggest Change: Sequencing is free 2000: Factory, AB3700 @ Celera - 1k 500bp reads/day/sequener = 0.5Mbp/day - Human Genome = ~ 190 sequencer yr, ~200M$ 2002 2002: Factory, AB3730 @ JCVI - 10k 500bp reads/sequencer/day = 5Mbp/day - Human Genome = ~ 19 sequencer yr, ~10M$ 2010 2010: Benchtop, 454 GS Junior - 70M 500bp reads/day = 35Gbp/day - Human genome = ~ 1 sequencer day, ~10k$ 2010: Service, Complete Genomics - Human genome = ~ 1 day, ~1k$

New Bottlenecks Generating sequence data – free Data Management Data Query Data Analysis Breadth: Communities Depth: Populations (e.g., flu, human) Thinking is very pricy!

Same Thinking $, More Data Project Cost

The Crux of the Problem Genomic data interpreted in context How does my genome compare to all others Which other proteins are similar to mine Size of context is growing exponentially Growth is faster than Moore’s law Hard to fight an exponential BLASTP against NCBI NR All against all BLASTP of microbial proteins

Bioinformatics Isn’t High Energy Physics Data inputs are changing rapidly CE Chromatograms, 454 Flowgrams, Color Space Error models and read lengths are changing rapidly Tools evolving rapidly Difficult to track many academic tools High quality commercial platforms emerge Even when “cooks” use shared “ingredients” “recipes” vary widely Faith based science My dataset alone has limited value Computations are (relatively) IO Intensive

Some Solutions and Directions Repeated process must be automated Even if labor is free, deviations from SOP costly Commercial Tools Market has expanded, quality improved Tools for exploring Human Variation The HuRef Browser Metagenomics Tools and Challenges Global Ocean Sampling Expedition Visualization tools Metagenomic Annotation Genome Standards Consortium and M5 Clouds and Grids ScaaS: Science as a Service

Personal Genomics: The future is now (ca 2008)

HuRef Browser: Accelerate thinking Compare 2 published genomes Craig Venter’s Diploid Genome Composite NCBI-36 Are differences real? Noisy data? Assembly errors? Analysis errors? Methods development requires curation by biologists As genomes accumulate, more acute challenge

HuRef Browser: http://huref.jcvi.org

Zinc Finger ProteinChr19:57564487-57581356 Transcript Gene Haplotype Blocks Variations NCBI-36 Assembly-Assembly Mapping HuRef Assembly Structure

Protein Truncated by 476 bp Insertion Heterozygous SNP Homozygous SNP Insertion

Genomics vs Metagenomics Genomics – ‘Old School’ Study of a single organism's genome Genome sequence determined using shotgun sequencing and assembly >1300 microbes sequenced, first in 1995 (at TIGR) DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells Metagenomics Use genomics tricks on communities – no culturing Environmental shotgun sequencing of DNA or RNA Metadata provides context

Metagenomic Questions Within an environment What biological functions are present (absent)? What organisms are present (absent)? Compare data from (dis)similar environments What are the fundamental rules of microbial ecology Adapting to environmental conditions? How do communities respond to stimuli? How does community structure change? Search for novel proteins and protein families And diversity within known families

Global Ocean Sampling Expedition

Global Ocean Sampling Expedition ,[object Object]

Pilot: 2.0M reads 4/04

Phase 1: 7.7M reads, >6M proteins 3/07

Phase 2-IO: 2.2M reads 3/08

Phase 2: ~30M reads 2010?

Open ocean, estuary, embayment, upwelling, fringing reef, atoll…4/04 3/07 3/08

GOS: Sequence Diversity in the OceanRusch et al (PLoS Biology2007) Most sequence reads are unique Very limited assembly Most sequences not taxonomically anchored Reference genomes a basis set? Not really. Several hundred isolates Challenges Relating shotgun data to reference genomes Structural and Functional Annotation

Browsing Large Data Collections: Fragment Recruitment Viewer Microbial Communities vs Reference Genomes Millions of sequence reads vs Thousands of genomes Definition: A read is recruited to a sequence if: End-to-end blastN alignment exists Rapid Hypothesis Generation and Exploration How do cultured and wildtype genomes differ? Insertions, deletion, translocations Correlation with environmental factors

Fragment Recruitment Viewer Sequence Similarity Genomic Position Doug Rusch, JCVI

GOS Protein AnalysisYooseph et al (PLoS Biology 2007) Novel clustering process ,[object Object]

Predict putative proteins and group into related clusters

Include GOS and all known proteinsFindings ,[object Object]

Trends In Genomics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Trends In Genomics

Similar to Trends In Genomics (20)

Recently uploaded

Recently uploaded (20)

Trends In Genomics

Editor's Notes