[MIT]Introduction to 2GS data analysisDrink faster !June 23, 2011
Production Informatics and BioinformaticsJune 23, 2011Produce raw sequence readsBasic ProductionInformaticsMap to genome and generate raw genomic features (e.g. SNPs)Advanced Production Inform.Analyze the data; Uncover the biological meaningBioinformaticsResearchPer one-flowcell project
First Generation: Sanger sequencingSecond Generation: amplified molecule sequencing Third Generation: single molecule sequencingBrief history of sequencing June 23, 2011*** Discussion about category
What steps are involved in sequencing ?June 23, 2011sequencing by synthesis (SBS) technologyFragmentationLibrary generationAmplificationSequencingAnalysisIllumina Marketing: “3h 10 minutes wet-lab30 minutes dry lab”
Illumina sequencing: Library + AmplificationJune 23, 2011“Illumina Sequencing Technology” booklet
Illumina Sequencing: Synthesis + ImagingJune 23, 2011“Illumina Sequencing Technology” booklet
Output: 1.5 Terabyte of dataJune 23, 2011Inspired by anzska information booklet
Sequencer Output Conversion: Production Informatics1.5 TB data : 6 billion clusters with 100 bp reads 	= 600 billion data points June 23, 2011HiSeqCASAVA…× read lengthFor HiSeq: images are converted to flat files (*.bcl or *.cif) visualpharm.comMaysoft
Multiplexing6 billion reads:750 million reads per laneCurrently 12-plex (soon 96-plex):One run  June 23, 2011Oliver Twardowski
DemultiplexingJune 23, 2011CASAVA……× samples× read lengthvisualpharm.com
CASAVA1.8.0 program callJune 23, 2011configureBclToFastq.pl \	--input-dir Data/Intensities/BaseCalls/ \    -output-dir Data/Unaligned \	--sample-sheet SampleSheet.csv \ 	--use-bases-mask y100,I6nn,Y100 >file.log 2>&1cd Data/Unalignedqsub -pe make 16 -jy -v $MYPATH –oqsub.out -cwd –N fastq -by \    make -j 16Runtime: ~ 6h
Fastq filesJune 23, 2011@HWI-ST301_0112:1:1:1169:2044#0/1CCATAAGGCCACGTATTTTGCAAGCTATTTAACTGGCGGCGAT+HWI-ST301_0112:1:1:1169:2044#0/1dddc\dd^dd`acacdacd`ecdedabdcdddcc\``\`bTa\36 36 36 35 28 …ASCII       @ .. ~DEC        64 .. 126PHRED     0 .. 62Phred scores are estimates only ! Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71. PMID:20015970
Fastq – PHRED qualityPathologicalJune 23, 2011
Fastq: Quality controlBase-pair quality score Adapter contaminationUneven Amplification June 23, 2011
Three things to rememberDon’t be fooled by marketingFastqfiles are not directly usableBasic-run QC can be made from fastq fileJune 23, 2011“All modern genomics projects are now bottlenecked at the stage of data analysis rather than data production”							Ewan Birney		      European Bioinformatics InstituteWellcome Trust David S. Roos  Bioinformatics--Trying to Swim in a Sea of Data;Science 16 February 2001: Vol. 291 no. 5507 pp. 1260-1261 DOI: 10.1126/science.291.5507.1260
Next Week:June 23, 2011Abstract: This session will focus on identifying SNPs from whole genome, exome capture or targeted resequencing data. The approaches of mapping, local realigment, recalibration, SNP calling, and SNP recalibration will be introduced and quality metrics discussed.
Walk-in-clinicJune 23, 2011
First Generation: Sanger sequencingSecond Generation: amplified molecule sequencing Third Generation: single molecule sequencingBrief history of sequencing June 23, 2011*** Discussion about category
Helicostrue Single Molecule Sequencing(tSMS)™ technologySequencing by synthesis but much more sensitive so no amplificationJune 23, 2011
Life Technology - Ion TorrentHydrogen Ion is released by the incorporation of a nucleotide, which is measured by a semiconductorDepending on which nucleotide wash cycle the signal coincidesJune 23, 2011
PacBioImmobilized polymerase at the bottom of a wellFluorescent nucleotides float around and if they are incorporated they are held still for tens of milliseconds, which is the signal that is recordedNo upper limit on the length  June 23, 2011http://www.pacificbiosciences.com/smrt-biology/smrt-technology?page=4
NanoporeMolecule is sucked through a poor and the change in the membrane charge due to the different nucleotides is recorded.June 23, 2011http://www.nanoporetech.com/sections/index/82

Introduction to second generation sequencing

  • 1.
    [MIT]Introduction to 2GSdata analysisDrink faster !June 23, 2011
  • 2.
    Production Informatics andBioinformaticsJune 23, 2011Produce raw sequence readsBasic ProductionInformaticsMap to genome and generate raw genomic features (e.g. SNPs)Advanced Production Inform.Analyze the data; Uncover the biological meaningBioinformaticsResearchPer one-flowcell project
  • 3.
    First Generation: SangersequencingSecond Generation: amplified molecule sequencing Third Generation: single molecule sequencingBrief history of sequencing June 23, 2011*** Discussion about category
  • 4.
    What steps areinvolved in sequencing ?June 23, 2011sequencing by synthesis (SBS) technologyFragmentationLibrary generationAmplificationSequencingAnalysisIllumina Marketing: “3h 10 minutes wet-lab30 minutes dry lab”
  • 5.
    Illumina sequencing: Library+ AmplificationJune 23, 2011“Illumina Sequencing Technology” booklet
  • 6.
    Illumina Sequencing: Synthesis+ ImagingJune 23, 2011“Illumina Sequencing Technology” booklet
  • 7.
    Output: 1.5 Terabyteof dataJune 23, 2011Inspired by anzska information booklet
  • 8.
    Sequencer Output Conversion:Production Informatics1.5 TB data : 6 billion clusters with 100 bp reads = 600 billion data points June 23, 2011HiSeqCASAVA…× read lengthFor HiSeq: images are converted to flat files (*.bcl or *.cif) visualpharm.comMaysoft
  • 9.
    Multiplexing6 billion reads:750million reads per laneCurrently 12-plex (soon 96-plex):One run June 23, 2011Oliver Twardowski
  • 10.
    DemultiplexingJune 23, 2011CASAVA……×samples× read lengthvisualpharm.com
  • 11.
    CASAVA1.8.0 program callJune23, 2011configureBclToFastq.pl \ --input-dir Data/Intensities/BaseCalls/ \ -output-dir Data/Unaligned \ --sample-sheet SampleSheet.csv \ --use-bases-mask y100,I6nn,Y100 >file.log 2>&1cd Data/Unalignedqsub -pe make 16 -jy -v $MYPATH –oqsub.out -cwd –N fastq -by \ make -j 16Runtime: ~ 6h
  • 12.
    Fastq filesJune 23,2011@HWI-ST301_0112:1:1:1169:2044#0/1CCATAAGGCCACGTATTTTGCAAGCTATTTAACTGGCGGCGAT+HWI-ST301_0112:1:1:1169:2044#0/1dddc\dd^dd`acacdacd`ecdedabdcdddcc\``\`bTa\36 36 36 35 28 …ASCII @ .. ~DEC 64 .. 126PHRED 0 .. 62Phred scores are estimates only ! Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71. PMID:20015970
  • 13.
    Fastq – PHREDqualityPathologicalJune 23, 2011
  • 14.
    Fastq: Quality controlBase-pairquality score Adapter contaminationUneven Amplification June 23, 2011
  • 15.
    Three things torememberDon’t be fooled by marketingFastqfiles are not directly usableBasic-run QC can be made from fastq fileJune 23, 2011“All modern genomics projects are now bottlenecked at the stage of data analysis rather than data production” Ewan Birney European Bioinformatics InstituteWellcome Trust David S. Roos Bioinformatics--Trying to Swim in a Sea of Data;Science 16 February 2001: Vol. 291 no. 5507 pp. 1260-1261 DOI: 10.1126/science.291.5507.1260
  • 16.
    Next Week:June 23,2011Abstract: This session will focus on identifying SNPs from whole genome, exome capture or targeted resequencing data. The approaches of mapping, local realigment, recalibration, SNP calling, and SNP recalibration will be introduced and quality metrics discussed.
  • 17.
  • 18.
    First Generation: SangersequencingSecond Generation: amplified molecule sequencing Third Generation: single molecule sequencingBrief history of sequencing June 23, 2011*** Discussion about category
  • 19.
    Helicostrue Single MoleculeSequencing(tSMS)™ technologySequencing by synthesis but much more sensitive so no amplificationJune 23, 2011
  • 20.
    Life Technology -Ion TorrentHydrogen Ion is released by the incorporation of a nucleotide, which is measured by a semiconductorDepending on which nucleotide wash cycle the signal coincidesJune 23, 2011
  • 21.
    PacBioImmobilized polymerase atthe bottom of a wellFluorescent nucleotides float around and if they are incorporated they are held still for tens of milliseconds, which is the signal that is recordedNo upper limit on the length June 23, 2011http://www.pacificbiosciences.com/smrt-biology/smrt-technology?page=4
  • 22.
    NanoporeMolecule is suckedthrough a poor and the change in the membrane charge due to the different nucleotides is recorded.June 23, 2011http://www.nanoporetech.com/sections/index/82

Editor's Notes

  • #2 http://2.bp.blogspot.com/_BPr6hpMG0tg/TSZdkYDcRvI/AAAAAAAAAjY/ReScIkWNySg/s1600/drink.jpg
  • #4 PCR where a labeled nucleotide is incorporated at random that terminates the PCR reaction. These fragments of different length are then separated on a gel and the sequence can be manually read from the labeled end nucleotides.
  • #5 Some of you have done some library prep already so you have a feel for how realistic 3h10 min are for this. This seminar goes through the analysis steps that are required to answer the question the data was generated for. So by the end of this seminar series you’ll have also a feel for how realistic 30 minutes is for the data analysis.
  • #19 PCR where a labeled nucleotide is incorporated at random that terminates the PCR reaction. These fragments of different length are then separated on a gel and the sequence can be manually read from the labeled end nucleotides.
  • #20 http://www.helicosbio.com/Technology/TrueSingleMoleculeSequencing/tabid/64/Default.aspx
  • #23 http://www.nanoporetech.com/sections/index/82