Applications of NGS include Explosion of NGS A gap exists between current sequence-generation and data analysis capabilities to extract relevant biological insights
Several sequencing platforms available on the market Each with unique chemistry and producing data with different characteristics Throughput varies very large
Preponderance of NGS data file formats to represent these data
Here is a list of characteristics we find desirable in a NGS file format Transition: Developed file formats that meet these requirements. File formats are not sufficient therefore we have developed a framework to use these formats and create analysis tools
This is an outline of the Goby Software Framework
Now I will discuss File formats
PB think xml but better
Brief overview of how schemas are written using PBs A collection of messages of type readEntries
Transition: to achieve compression we gzip collections of messages
Protocol buffers do not support messages larger than a few megabytes Contribution of Goby is implementing Protocol buffers in such way to remove the collection size limitation scale for very large messages Overcome by splitting messages into chunks Each Chunk of a compact reads file represents 10,000 or less ReadEntry messages Supports semi random access Chunking leveraged for parallel processing – different servers can access chunks independently - Semi Random Access
Gzip and chunking meet the requirements for random access and streaming Transition: how well do we do with respect to file sizes
Apple --- apples comparison Multiple alignments
Formats are compact How can YOU use it? Low level API’s
One practical example of printing entries in an alignment file Goby makes it easy to write code to iterate over the contents of multiple compact alignment files
Goby provides utilities to help build analysis pipeline
MAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on multiple platforms Normalized gene expression counts RPKM Random hexamer priming results in a bias in nucleotide composition at the start of sequence reads Hansen KD. et al. Nucleic Acids Res. 2010 Jul 1;38(12):e131. Epub2010 Apr 14 Hansen Reweighting scheme to correct for that bias implemented in Goby for genes
Ambion Human Brain Reference RNA -- MAQCII sample B Different Brain regions from 23 donors.