The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.
3. New Bottlenecks Generating sequence data – free Data Management Data Query Data Analysis Breadth: Communities Depth: Populations (e.g., flu, human) Thinking is very pricy!
5. The Crux of the Problem Genomic data interpreted in context How does my genome compare to all others Which other proteins are similar to mine Size of context is growing exponentially Growth is faster than Moore’s law Hard to fight an exponential BLASTP against NCBI NR All against all BLASTP of microbial proteins
6. Bioinformatics Isn’t High Energy Physics Data inputs are changing rapidly CE Chromatograms, 454 Flowgrams, Color Space Error models and read lengths are changing rapidly Tools evolving rapidly Difficult to track many academic tools High quality commercial platforms emerge Even when “cooks” use shared “ingredients” “recipes” vary widely Faith based science My dataset alone has limited value Computations are (relatively) IO Intensive
7. Some Solutions and Directions Repeated process must be automated Even if labor is free, deviations from SOP costly Commercial Tools Market has expanded, quality improved Tools for exploring Human Variation The HuRef Browser Metagenomics Tools and Challenges Global Ocean Sampling Expedition Visualization tools Metagenomic Annotation Genome Standards Consortium and M5 Clouds and Grids ScaaS: Science as a Service
14. Genomics vs Metagenomics Genomics – ‘Old School’ Study of a single organism's genome Genome sequence determined using shotgun sequencing and assembly >1300 microbes sequenced, first in 1995 (at TIGR) DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells Metagenomics Use genomics tricks on communities – no culturing Environmental shotgun sequencing of DNA or RNA Metadata provides context
15. Metagenomic Questions Within an environment What biological functions are present (absent)? What organisms are present (absent)? Compare data from (dis)similar environments What are the fundamental rules of microbial ecology Adapting to environmental conditions? How do communities respond to stimuli? How does community structure change? Search for novel proteins and protein families And diversity within known families
24. GOS: Sequence Diversity in the OceanRusch et al (PLoS Biology2007) Most sequence reads are unique Very limited assembly Most sequences not taxonomically anchored Reference genomes a basis set? Not really. Several hundred isolates Challenges Relating shotgun data to reference genomes Structural and Functional Annotation
25. Browsing Large Data Collections: Fragment Recruitment Viewer Microbial Communities vs Reference Genomes Millions of sequence reads vs Thousands of genomes Definition: A read is recruited to a sequence if: End-to-end blastN alignment exists Rapid Hypothesis Generation and Exploration How do cultured and wildtype genomes differ? Insertions, deletion, translocations Correlation with environmental factors
37. Annotation ofEnvironmental Shotgun Data Challenges: Lack of context Protein fragments Gene Finding Yooseph’s Protein Clusters + Metagene Functional Assignment Variation of JCVI prok annotation pipeline* Leverages protein cluster annotation -- soon Result: Quality Nearly Comparable to Prokaryotic Genomic Annotation
38. Protein ClustersAdvantages and Disadvantages Weaknesses Homology-based Stateful (also a strength) Less sensitive (for now) Strengths Exponential Linear? Learns over time Easy to maintain
39. Increasing the pressure Nextgen + Metagenomics Deeper collections Short sequences less informative How should we annotate? When in doubt, use BLAST against NRAA, and other large and fast-growing collections Annotation needs growing dramatically 24x7 quality software Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE? New algorithms? Back to supercomputers? Sharing data and computes Standardization of data, metadata, and computes Folker Meyer, ANL
40. Science as a Service (ScaaS) Standard tools as services Service-Oriented Architecture Supported by HPC as necessary Grid workflow for integration Maintain tools & data in scalable compute environment Celera Assembler in the clouds
41. Vision for High Throughput Science Today: Scientist Construction of the Ark. Nuremberg Chronicle (1493).
42. Vision for High Throughput Science Engineers Scientist + http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html Rodin’s Thinker
With the publication of the genomes of Craig Venter and Jim Watson, and with many additional human genomes being sequenced, the era of personal genomics is here.We are going to need really good tools to take advantage of this flood of data. My goal today is to share our experience building tools to understand the variation within a single individual’s genome, and try to extrapolate forward to what we will need to understand larger collections of genomes.
* A chromosome or sequence id followed by a start position and region length e.g., "chr19:450000+100000" to display the region from 450000-550000 on chromosome 19. * A dbSNP id e.g., "rs2691286" * An Ensembl annotation identifer e.g., "ENSG00000104783" * A gene name, e.g. "KLKB1", optionally followed by the amount of flanking sequence to display e.g., "KLKB1^2000"
Zinc Finger example whole transcript ENST00000334564
INSERT IS 467 BP TRUNCATES THE PROTEINVNTRPROB HETEROzygousPink = non-synYellow –synpnymous