Your SlideShare is downloading. ×
Kitzmiller Openhelisphereproject Bosc2008
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Kitzmiller Openhelisphereproject Bosc2008

783
views

Published on

Published in: Technology, News & Politics

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
783
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Open HeliSphere ™ project True open source from the inventors of True Single Molecule Sequencing (tSMS ™) . Aaron Kitzmiller BOSC 2008
  • 2. Agenda
    • Introduction to the HeliScope Single Molecule Sequencer
    • Helicos and Open Source‏
    • The Open HeliSphere project
    • HeliSphere code
  • 3. Single Molecule Sequencing by Synthesis Hybridize Primer 1 ~1/um 2 T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T
  • 4. Extend ‘ G’ Single Molecule Sequencing by Synthesis G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T
  • 5. Wash SM Sequence by Synthesis G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T
  • 6. Image SM Sequence by Synthesis T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T
  • 7. Cleave SM Sequence by Synthesis T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T T G A A C G T A C T T G C C G C A T G A A C G A C T T G C T G A A C G A C T T G C C T A C T G A C G T C T G G 5’ 5’ T G G G G G G G G T G A A C G T G A A C G T G A A C G 5’ 5’ T A C T T G C C G C A A C T T G C A C T T G C C T A C T G A C G T C T T
  • 8. Flow Cell Imaging
    • 1 run => 2 flow cells
    • 1 flow cell => 25 channels
    • 1 channel => 1000 fields of view (FOV)‏
    • 1 FOV => 4 images
    • 8-10 million usable strands / channel
    Flow Cell 25 Channels (1.6 x 90 mm)‏ ~12 x 12 cm Flow cell volume = 180 µL
  • 9. Raw data collection - C - A G C T - - C T - G - T A - C T - G - - A G - - A - - - - A - C - A G C - - G - - - G - T - G - - - - - - - G X C T A G C T A G C T A G C T A G C T A G C T A G C T A G - C - A - C T - - C - - G C - A - - T - - C - A - - T - G - - - A G - - A - - T - - C - A - - T - - - - A - C T - - - - - - G - T A - - T - G - - - - - T A - - T A G - - - -
  • 10. HeliScope and HeliSphere
  • 11. Helicos and Open Source
    • Helicos is an instrument company
    • The diversity of bioinformatics applications is too large for us to address internally
    • Open source bioinformatics applications benefit everyone, including instrument developers
    • Helicos bioinformatics applications
      • Internal development
      • Academic and industrial partnerships
      • Tool vendor partnerships
      • Open Source
        • Including contributions by Helicos to other projects (BioPerl, Bioconductor, etc)‏
  • 12. The Open HeliSphere project
    • Pre-launch – TODAY
      • SVN trunk checkout
      • Tarball download
      • openhelisphere-announce, openhelisphere-devel
      • Wiki documentation
      • Datasets
    • Full product launch
      • Patch submission to HeliSphere core
      • Bug tracking
      • HeliSphere contrib repository
      • http://open.helicosbio.com
  • 13. The Open HeliSphere project
    • License
      • Dual GPL + commercial
    • Infrastructure
      • Mediawiki-driven website (semantically enhanced)
      • SourceForge mail, tarballs, Subversion source code control
      • http://open.helicosbio.com
  • 14. Bioinformatics Pipeline for Digital Gene Expression
  • 15. SRF file processing
    • HeliScope sequencers create SRF files
      • Consortium driven binary read container
        • Strong Sanger involvement
        • used for submission to the NCBI Short Read Archive
      • Reads are stored in ZTR blocks
      • Instrument and run information is stored in an XML document
      • SRF processing converts reads into a smaller, Helicos-oriented format called SMS.
        • Perl scripts run the srf2sms binary
        • SMS places reads into blocks that are indexed by key Helicos data fields (flowcell, channel, position)‏
        • Extracted instrument and run XML are used for pipeline configuration
  • 16. SMS file
    • SMS is a general binary data container
      • Manipulate with executables: smsls, sms2txt, srf2sms, filterSMS, extractSMS, mergeSMS
      • Access data directly via C++ iterator API
    read_iterator<read_record> rit(smsfile); read_record read;   //query the SMS file for desired flowcell/channel rit.select_channel(flowcell,channel);   //iterate over result set, default out format to ostream is fasta while(!rit.end()){ read = *rit; outf << read; rit++; } outf.close();
  • 17. Pipeline configuration
    • Pipeline is a combination of Perl modules and scripts driven by XML configuration
      • Analysis combines a Protocol and parameters with a Reference Set
      • Reference Set is a pointer to one or more FASTA files
      • Protocol is a pointer to one or more executables and parameters
      • Instrument and Run XML are extracted from the SRF file.
      • analysis_controller converts XML documents into MLDBM database
  • 18. DGE analysis
    • DGE pipeline features common processing steps
      • Counting of aligned transcripts
      • extractSMS, filterSMS remove poor quality sequences
        • Base addition order (CTAG) sequences
        • Quality score
        • Read length
        • Normalized alignment score
      • IndexDP alignment
        • Helicos developed aligner
        • Mismatch tolerant seeded alignment with multiple alignment modes
  • 19. IndexDP 10mer word Template length 15, weight 10 w/sub
    • On-the-fly indexes are constructed using template families
      • Families are arrangements of positions that accommodate a given template length, weight, and mismatch number (e.g. 20:16:2)‏
    • BLAST, et al. match on contiguous words and then extend to support fast, gapped alignments
    • IndexDP uses templates to accommodate mismatches in the words
    ACGT AC G TA CCCGTA AAG ACGT AC A TA CCCGTA TTTACTTTACGT ACGTACATA CCCGTA AAG ACGTACATA CCCGTA TTTACTTTACGT
  • 20. IndexDP
    • After template matching, the bitHPDP core performs a dynamic programming algorithm. Supported alignment flavors:
      • Smith-Waterman
      • Global-Local. Full length of the read against a region of the reference. End gaps against the reference have zero penalty
      • Local-Local. End gaps have no penalty
      • Global-Global. Needleman-Wunsch
  • 21. QC analysis
    • errorTool uses sample alignments to reference to calculate error rates
      • Uses bitHPDP core
      • Breaks down error rates on a number of dimensions (by nucleotide, by substitution type, by reference position, by image (X,Y), by incorporation cycle, etc.)‏
      • Error rates of < 1% are seen with Two Pass Sequencing; single pass is 7% or less
    • lengthTool calculates length distribution and term+loss stats
      • Can provide length as aligned
      • Termination and loss indicate strands that stop incorporating base
  • 22. Length distributions (yeast DGE experiment)‏ Raw: Unfiltered reads, 6mer and above Filtered : Quality score filter, AT < 0.9, BAO dinuc<0.7, trim leading Ts, length >= 20, alignment against BAO, P102 Aligned : Normalized score >= 4 Company confidential
  • 23. Error rates and alignments (yeast DGE experiment)‏ Error-rates were assessed using samples of alignments with normalized alignment score ≥4 to a high-expresser (YLR110C/CCW12)‏ 6.55% 0.44% 4.72% 1.39% Total Sub Del Ins GACGT-TATG G GTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-A C CGTGCTAAACAATCC Reference GACGT-TATG A GTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-A T CGTGCTAAACAATCC Consensus --------------------------------------------------------------------------- TGATGGTAGTAACGATGATGACGAAGA-TAA CCCGCTG--A T CGTGCTAAACA-TC Reads GACGT-TATG A GTGATGGTAGTAACGATGATGA-GAAGA GC-A T CGTGCTAAACA-TCC A-GTATATG A GTGATGGTAGTAACGATGATGACGAAGAATA A T CGTGCTAAACAATCC GACGT-TATG A GTGATGGTAGTAACGATGATGACGA AATGTAGACCCGCTGC-A T CGTGCTAAACAATCC ACGT-TATG A GTGATG-TAGTAACGATGATGACGAAGA-TAA GACGT-TATG A GT ACGAAGA-TAATGTAGACCCGCTGCTA T CGT-CTA GACGT-TATG A GTGATG-TA GA-TAATGTAGACCTGC-GC-A T CGTGCTAAACAA GACGT-TATG A GTGATG GA-TAAT-TAGACCCGCTG--A T CGTG-TAA-CAA GACGT-TATG A GTGATGGTAGTAACGATGATGACG
  • 24. Acknowledgments
    • Ed Thayer
    • Eldar Giladi
    • John Healy
    • Doron Lipson
    • Keith Moulton
    • Steve Roels
    Original research shouldn’t start with copies
  • 25. Hybrid development model Source code repository Read-only source code subset User-owned packages Secure sync Company firewall
  • 26. Typical closed source development Source code repository Company firewall
  • 27. Typical open source project Source code repository Direct commit Checkout Submit patch via email
  • 28. HeliScope and HeliSphere

×