• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Kitzmiller Openhelisphereproject Bosc2008
 

Kitzmiller Openhelisphereproject Bosc2008

on

  • 1,561 views

 

Statistics

Views

Total Views
1,561
Views on SlideShare
1,558
Embed Views
3

Actions

Likes
0
Downloads
9
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Kitzmiller Openhelisphereproject Bosc2008 Kitzmiller Openhelisphereproject Bosc2008 Presentation Transcript

    • The Open HeliSphere ™ project True open source from the inventors of True Single Molecule Sequencing (tSMS ™) . Aaron Kitzmiller BOSC 2008
    • Agenda
      • Introduction to the HeliScope Single Molecule Sequencer
      • Helicos and Open Source‏
      • The Open HeliSphere project
      • HeliSphere code
    • Single Molecule Sequencing by Synthesis Hybridize Primer 1 ~1/um 2 T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T
    • Extend ‘ G’ Single Molecule Sequencing by Synthesis G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T
    • Wash SM Sequence by Synthesis G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T
    • Image SM Sequence by Synthesis T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T
    • Cleave SM Sequence by Synthesis T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T
    • Flow Cell Imaging
      • 1 run => 2 flow cells
      • 1 flow cell => 25 channels
      • 1 channel => 1000 fields of view (FOV)‏
      • 1 FOV => 4 images
      • 8-10 million usable strands / channel
      Flow Cell 25 Channels (1.6 x 90 mm)‏ ~12 x 12 cm Flow cell volume = 180 µL
    • Raw data collection - C - A G C T - - C T - G - T A - C T - G - - A G - - A - - - - A - C - A G C - - G - - - G - T - G - - - - - - - G X C T A G C T A G C T A G C T A G C T A G C T A G C T A G - C - A - C T - - C - - G C - A - - T - - C - A - - T - G - - - A G - - A - - T - - C - A - - T - - - - A - C T - - - - - - G - T A - - T - G - - - - - T A - - T A G - - - -
    • HeliScope and HeliSphere
    • Helicos and Open Source
      • Helicos is an instrument company
      • The diversity of bioinformatics applications is too large for us to address internally
      • Open source bioinformatics applications benefit everyone, including instrument developers
      • Helicos bioinformatics applications
        • Internal development
        • Academic and industrial partnerships
        • Tool vendor partnerships
        • Open Source
          • Including contributions by Helicos to other projects (BioPerl, Bioconductor, etc)‏
    • The Open HeliSphere project
      • Pre-launch – TODAY
        • SVN trunk checkout
        • Tarball download
        • openhelisphere-announce, openhelisphere-devel
        • Wiki documentation
        • Datasets
      • Full product launch
        • Patch submission to HeliSphere core
        • Bug tracking
        • HeliSphere contrib repository
        • http://open.helicosbio.com
    • The Open HeliSphere project
      • License
        • Dual GPL + commercial
      • Infrastructure
        • Mediawiki-driven website (semantically enhanced)
        • SourceForge mail, tarballs, Subversion source code control
        • http://open.helicosbio.com
    • Bioinformatics Pipeline for Digital Gene Expression
    • SRF file processing
      • HeliScope sequencers create SRF files
        • Consortium driven binary read container
          • Strong Sanger involvement
          • used for submission to the NCBI Short Read Archive
        • Reads are stored in ZTR blocks
        • Instrument and run information is stored in an XML document
        • SRF processing converts reads into a smaller, Helicos-oriented format called SMS.
          • Perl scripts run the srf2sms binary
          • SMS places reads into blocks that are indexed by key Helicos data fields (flowcell, channel, position)‏
          • Extracted instrument and run XML are used for pipeline configuration
    • SMS file
      • SMS is a general binary data container
        • Manipulate with executables: smsls, sms2txt, srf2sms, filterSMS, extractSMS, mergeSMS
        • Access data directly via C++ iterator API
      read_iterator<read_record> rit(smsfile); read_record read;   //query the SMS file for desired flowcell/channel rit.select_channel(flowcell,channel);   //iterate over result set, default out format to ostream is fasta while(!rit.end()){ read = *rit; outf << read; rit++; } outf.close();
    • Pipeline configuration
      • Pipeline is a combination of Perl modules and scripts driven by XML configuration
        • Analysis combines a Protocol and parameters with a Reference Set
        • Reference Set is a pointer to one or more FASTA files
        • Protocol is a pointer to one or more executables and parameters
        • Instrument and Run XML are extracted from the SRF file.
        • analysis_controller converts XML documents into MLDBM database
    • DGE analysis
      • DGE pipeline features common processing steps
        • Counting of aligned transcripts
        • extractSMS, filterSMS remove poor quality sequences
          • Base addition order (CTAG) sequences
          • Quality score
          • Read length
          • Normalized alignment score
        • IndexDP alignment
          • Helicos developed aligner
          • Mismatch tolerant seeded alignment with multiple alignment modes
    • IndexDP 10mer word Template length 15, weight 10 w/sub
      • On-the-fly indexes are constructed using template families
        • Families are arrangements of positions that accommodate a given template length, weight, and mismatch number (e.g. 20:16:2)‏
      • BLAST, et al. match on contiguous words and then extend to support fast, gapped alignments
      • IndexDP uses templates to accommodate mismatches in the words
      ACGT AC G TA CCCGTA AAG ACGT AC A TA CCCGTA TTTACTTTACGT ACGTACATA CCCGTA AAG ACGTACATA CCCGTA TTTACTTTACGT
    • IndexDP
      • After template matching, the bitHPDP core performs a dynamic programming algorithm. Supported alignment flavors:
        • Smith-Waterman
        • Global-Local. Full length of the read against a region of the reference. End gaps against the reference have zero penalty
        • Local-Local. End gaps have no penalty
        • Global-Global. Needleman-Wunsch
    • QC analysis
      • errorTool uses sample alignments to reference to calculate error rates
        • Uses bitHPDP core
        • Breaks down error rates on a number of dimensions (by nucleotide, by substitution type, by reference position, by image (X,Y), by incorporation cycle, etc.)‏
        • Error rates of < 1% are seen with Two Pass Sequencing; single pass is 7% or less
      • lengthTool calculates length distribution and term+loss stats
        • Can provide length as aligned
        • Termination and loss indicate strands that stop incorporating base
    • Length distributions (yeast DGE experiment)‏ Raw: Unfiltered reads, 6mer and above Filtered : Quality score filter, AT < 0.9, BAO dinuc<0.7, trim leading Ts, length >= 20, alignment against BAO, P102 Aligned : Normalized score >= 4 Company confidential
    • Error rates and alignments (yeast DGE experiment)‏ Error-rates were assessed using samples of alignments with normalized alignment score ≥4 to a high-expresser (YLR110C/CCW12)‏ 6.55% 0.44% 4.72% 1.39% Total Sub Del Ins GACGT-TATG G GTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-A C CGTGCTAAACAATCC Reference GACGT-TATG A GTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-A T CGTGCTAAACAATCC Consensus --------------------------------------------------------------------------- TGATGGTAGTAACGATGATGACGAAGA-TAA CCCGCTG--A T CGTGCTAAACA-TC Reads GACGT-TATG A GTGATGGTAGTAACGATGATGA-GAAGA GC-A T CGTGCTAAACA-TCC A-GTATATG A GTGATGGTAGTAACGATGATGACGAAGAATA A T CGTGCTAAACAATCC GACGT-TATG A GTGATGGTAGTAACGATGATGACGA AATGTAGACCCGCTGC-A T CGTGCTAAACAATCC ACGT-TATG A GTGATG-TAGTAACGATGATGACGAAGA-TAA GACGT-TATG A GT ACGAAGA-TAATGTAGACCCGCTGCTA T CGT-CTA GACGT-TATG A GTGATG-TA GA-TAATGTAGACCTGC-GC-A T CGTGCTAAACAA GACGT-TATG A GTGATG GA-TAAT-TAGACCCGCTG--A T CGTG-TAA-CAA GACGT-TATG A GTGATGGTAGTAACGATGATGACG
    • Acknowledgments
      • Ed Thayer
      • Eldar Giladi
      • John Healy
      • Doron Lipson
      • Keith Moulton
      • Steve Roels
      Original research shouldn’t start with copies
    • Hybrid development model Source code repository Read-only source code subset User-owned packages Secure sync Company firewall
    • Typical closed source development Source code repository Company firewall
    • Typical open source project Source code repository Direct commit Checkout Submit patch via email
    • HeliScope and HeliSphere