• Like
  • Save
Chambwe bosc2010
Upcoming SlideShare
Loading in...5
×
 

Chambwe bosc2010

on

  • 651 views

 

Statistics

Views

Total Views
651
Views on SlideShare
651
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Applications of NGS include Explosion of NGS A gap exists between current sequence-generation and data analysis capabilities to extract relevant biological insights
  • Several sequencing platforms available on the market Each with unique chemistry and producing data with different characteristics Throughput varies  very large
  • Preponderance of NGS data file formats to represent these data
  • Here is a list of characteristics we find desirable in a NGS file format Transition: Developed file formats that meet these requirements. File formats are not sufficient therefore we have developed a framework to use these formats and create analysis tools
  • This is an outline of the Goby Software Framework
  • Now I will discuss File formats
  • PB think xml but better
  • Brief overview of how schemas are written using PBs A collection of messages of type readEntries
  • Transition: to achieve compression we gzip collections of messages
  • Protocol buffers do not support messages larger than a few megabytes Contribution of Goby is implementing Protocol buffers in such way to remove the collection size limitation scale for very large messages Overcome by splitting messages into chunks Each Chunk of a compact reads file represents 10,000 or less ReadEntry messages Supports semi random access Chunking leveraged for parallel processing – different servers can access chunks independently - Semi Random Access
  • Gzip and chunking meet the requirements for random access and streaming Transition: how well do we do with respect to file sizes
  • Apple --- apples comparison Multiple alignments
  • Formats are compact How can YOU use it? Low level API’s
  • One practical example of printing entries in an alignment file Goby makes it easy to write code to iterate over the contents of multiple compact alignment files
  • Goby provides utilities to help build analysis pipeline
  • MAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on multiple platforms Normalized gene expression counts RPKM Random hexamer priming results in a bias in nucleotide composition at the start of sequence reads Hansen KD. et al. Nucleic Acids Res. 2010 Jul 1;38(12):e131. Epub2010 Apr 14 Hansen Reweighting scheme to correct for that bias implemented in Goby for genes
  • Ambion Human Brain Reference RNA -- MAQCII sample B Different Brain regions from 23 donors.

Chambwe bosc2010 Chambwe bosc2010 Presentation Transcript

  • THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS
    • Nyasha Chambwe , Kevin C. Dorff, Marko Srdanovic, Xutao Deng, Stuart J.D. Andrews, Fabien Campagne
    • The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine;
    • Department of Physiology and Biophysics
    • Weill Medical College of Cornell University
    http://goby.campagnelab.org
  • Applications of Next Generation Sequencing McPherson J.D. Nat Methods. 2009
  • Next Generation Sequencers Metzker, M.L. Nat Rev Genet. 2010 Roche/454 GS FLX Titanium Illumina/Solexa GA IIe Life Technologies SOLiD 3 Helicos BioSciences Heliscope NGS Chemistry Pyrosequencing Reversible Terminators Sequencing by ligation Reversible Terminators Avg Read Length (bp) 330 75 50 32 Run Time (days) 0.35 4 7 8 Giga bases/run 0.45 18 30 37 Million reads/run 1.36 240 600 1156
  • Next Generation Sequence Data Formats
    • Key Limitations
    • Text based formats do not scale well to handle large amounts of data
    • Naïve compression prevents semi-random access
  • File Format Wish List
    • Structured schema/data representation
    • Well specified and documented (not ambiguous)
    • Fast parsing speed
    • Language and operating system portability
    • Backward and forward compatibility
    • Compression
    • Random access
    • Streaming
  • File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • Structured non-ambiguous representation
    • Goby uses Protocol Buffers (PB) to provide “a flexible, efficient, automated mechanism for serializing structured data” (PB website)
    • PB generate parsers in different languages e.g., Java, C++, Python, Perl, R, C, C#, Visual Basic, PHP, Objective C, Ruby, Common Lisp
    • Provide forward and backward compatibility
  • Goby compact formats
    • Data is represented by Protocol Buffers as a message defined by a .proto file
  • File Format Wish List
    • Structured schema/data representation
    • Well specified and documented (not ambiguous)
    • Fast parsing speed
    • Language and operating system portability
    • Backward and forward compatibility
    • Compression
    • Random access
    • Streaming
  • Goby compact formats
    • Chunking:
    • Semi-random access
    • Efficient parallel processing
  • File Format Wish List
    • Structured schema/data representation
    • Well specified and documented (not ambiguous)
    • Fast parsing speed
    • Language and operating system portability
    • Backward and forward compatibility
    • Compression
    • Random access
    • Streaming
  • Goby File Size Comparisons MAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on four next-gen platforms
  • File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python Readers Writers Iterators The Goby Software Framework reads alignments histograms File Format Conversions Alignment Processing Visualization RNA-Seq Pipeline IGV Plug-in
  • File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python Readers Writers Iterators RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms File Format Conversions Alignment Processing Visualization
  • Alignment Iterator
    • Code fragment to:
    • Scan through two alignments (input1, input2)
    • Print information for each entry
    • Print information for chromosomes 1,2,X only
  • File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • RNA-Seq Pipeline
    • Objective: To determine levels of expression in samples and perform differential expression analysis
    • Supports:
      • Mapping to full genome
      • Mapping to annotated cDNAs (reads match inside exons and across exon-exon boundaries)
    • Sequencing platform independent
    • Published normalization methods implemented
      • Mortazavi A et al. Nat Methods. 2008
      • Bullard JH et al. BMC Bioinformatics. 2010
    • Bias correction for platform specific biases
      • Hansen KD et al. Nucleic Acids Res. 2010
  • Sample RNA-Seq Results
  • Conclusion
    • Goby file formats are efficient and non-ambiguous
    • Alignments are about five times smaller than BAM alignments
    • API makes it easy to write efficient code to handle large datasets
    • Framework provides utilities and analysis pipelines for common NGS data analysis tasks
  • Acknowledgements
    • Campagne Lab
    • Fabien Campagne
    • Kevin C. Dorff
    • Marko Srdanovic
    • Stuart J.D. Andrews
    • Broad Institute
    • Jim Robinson
    http://goby.campagnelab.org FDA/NCTR Leming Shi Sequencing Quality Control Project (SEQC) Helicos Illumina Life Technologies Roche
  •  
  • cDNA Search