20110114 Next Generation Sequencing Course
Upcoming SlideShare
Loading in...5

20110114 Next Generation Sequencing Course



Next Generation Sequencing course

Next Generation Sequencing course
2011-01-14 Nantes (By the way, I remember where I found this idea of using Star-Trek: it came from a presentation of the GATK team)



Total Views
Views on SlideShare
Embed Views



6 Embeds 20

https://bb9.tamucc.edu 12
https://wikis.gla.ac.uk 3
http://paper.li 2
http://twitter.com 1
http://webcache.googleusercontent.com 1
https://twitter.com 1



Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


12 of 2

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Lots of examples of useful command line invocations of samtools and more.
    Are you sure you want to
    Your message goes here
  • Very impressive slide! Thank you for sharing !
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Samples consisting of longer fragments are first sheared into a random library of 100-300 base-pair long fragments. After fragmentation the ends of the obtained DNA-fragments are repaired and an A-overhang is added at the 3'-end of each strand. Afterwards, adaptors which are necessary for amplification and sequencing are ligated to both ends of the DNA-fragments. These fragments are then size selected and purified.
  • From the following article Next-generation DNA sequencing Jay Shendure & Hanlee Ji Nature Biotechnology 26, 1135 - 1145 (2008) Published online: 9 October 2008 doi:10.1038/nbt1486 (a) The 454, the Polonator and SOLiD platforms rely on emulsion PCR20 to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi-template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water-in-oil emulsion. One of the PCR primers is tethered to the surface (5'-attached) of micron-scale beads that are also included in the reaction. A low template concentration results in most bead-containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. (b) The Solexa technology relies on bridge PCR21, 22 (aka 'cluster PCR') to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library is PCR amplified, but both primers densely coat the surface of a solid substrate, attached at their 5' ends by a flexible linker. As a consequence, amplification products originating from any given member of the template library remain locally tethered near the point of origin. At the conclusion of the PCR, each clonal cluster contains approx1,000 copies of a single member of the template library. Accurate measurement of the concentration of the template library is critical to maximize the cluster density while simultaneously avoiding overcrowding.
  • During sequencing the huge amount of generated clusters are sequenced simultaneously. The DNA-templates are copied base by base using the four nucleotides (ACGT) which are fluorescently-labeled and reversibly terminated. After each synthesis step, the clusters are excited by a laser which causes fluorescence of the last incorporated base. After that, the fluorescence label and the blocking group are removed allowing the addition of the next base. The flourescence signal after each incorporation step is captured by a built-in camera, producing images of the flow cell.
  • The emPCR amplifies each fragment several million times. After amplification the emulsion shell is broken and the clonally amplified beads are ready for loading onto the fibre-optic PicoTiterDevice for sequencing.
  • he template strand is represented in red, the annealed primer is shown in black and the DNA polymerase is shown as the green oval. Incorporation of the complementary base (the blue "G") generates inorganic pyrophosphate (PPi), which is converted to ATP by the sulfurylase (blue arrow). Luciferase (red arrow) uses the ATP to convert luciferin to oxyluciferin, producing light.
  • Genome Biol. 2009; 10(3): R32. Published online 2009 March 27. doi: 10.1186/gb-2009-10-3-r32. PMCID: PMC2691003 Copyright © 2009 Harismendy et al.; licensee BioMed Central Ltd. Evaluation of next generation sequencing platforms for population targeted sequencing studies Olivier Harismendy,#1 Pauline C Ng,#2 Robert L Strausberg,2 Xiaoyun Wang,1 Timothy B Stockwell,2 Karen Y Beeson,2 Nicholas J Schork,1 Sarah S Murray,1 Eric J Topol,1 Samuel Levy,corresponding author2 and Kelly A Frazercorresponding author1 Performance metrics of NGS technologies. (a-f) Error bars represent minimum and maximum values obtained from the four samples. (g-i) Venn diagram representation of false positive calls (g), false negative calls (h) and discrepant variants calls (i). The inset caption displays the color-coding of each NGS technology and overlaps: for Roche 454 (red), Illumina GA (yellow) and ABI SOLiD (blue). For each NGS platform the number of base calls with errors associated with specific sequence contexts is given (repeat = repetitive element). When two sequence contexts are present they are both listed.
  • Historical trends in storage prices versus DNA sequencing costs. The blue squares describe the historic cost of disk prices in megabytes per US dollar. The long-term trend (blue line, which is a straight line here because the plot is logarithmic) shows exponential growth in storage per dollar with a doubling time of roughly 1.5 years. The cost of DNA sequencing, expressed in base pairs per dollar, is shown by the red triangles. It follows an exponential curve (yellow line) with a doubling time slightly slower than disk storage until 2004, when next generation sequencing (NGS) causes an inflection in the curve to a doubling time of less than 6 months (red line). These curves are not corrected for inflation or for the 'fully loaded' cost of sequencing and disk storage, which would include personnel costs, depreciation and overhead.
  • Cloud computing and the DNA data race Journal name: Nature Biotechnology Volume: 28, Pages: 691–693 Year published: (2010) DOI: doi:10.1038/nbt0710-691
  • HWUSI-EAS100R the unique instrument name 6 flowcell lane 73 tile number within the flowcell lane 941 'x'-coordinate of the cluster within the tile 1973 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
  • TDe novo fragment assembly with short mate-paired reads: Does the read length matter?, doi: 10.1101/gr.079053.108 Genome Res. 2009. 19: 336-346 positional profile of base-calling errors for Illumina reads for 2 million 50-nt-long reads from a human BAC. The error rate across reads is shown (solid line) along with the error rate for reads with a fixed number of errors. The erroneous nucleotides in each read are detected by mapping the read to the reference genome. The high error rate in position 6 is due to the bias in our particular data set rather than a systematic problem with the Illumina technology.
  • Sequence reads with associated read identifiers are shown, with the regions that will be used for seed selection in capital letters and matched seeds of 0011 and 1100. Given read identifiers are associated with the seeds using a hash function (for example, a unique integer representation of each seed). Once such a hash table has been built for either the input read set or the reference genome, the corresponding data can be scanned with the same hash function, resulting in a much smaller subset of reads to more exactly align at each location in the genome.
  • Schematic representation of our implementation of the de Bruijn graph. Each node, represented by a single rectangle, represents a series of overlapping k-mers (in this case, k = 5), listed directly above or below. (Red) The last nucleotide of each k-mer. The sequence of those final nucleotides, copied in large letters in the rectangle, is the sequence of the node. The twin node, directly attached to the node, either below or above, represents the reverse series of reverse complement k-mers. Arcs are represented as arrows between nodes. The last k-mer of an arc’s origin overlaps with the first of its destination. Each arc has a symmetric arc. Note that the two nodes on the left could be merged into one without loss of information, because they form a chain.
  • Genome Res. 2009 Sep;19(9):1586-92. Epub 2009 Aug 5. Sensitive and accurate detection of copy number variants using read depth of coverage. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.
  • cut -d ' ' -f 1,2 pileup.filtered.txt | awk '{printf("%s\\t%d\\t%d\\n",$1,int($2)-1,int($2));}' > $@

20110114 Next Generation Sequencing Course 20110114 Next Generation Sequencing Course Presentation Transcript

  • Next Generation Sequencing Nantes, December 10 th 2010 Pierre Lindenbaum PhD [email_address] http://plindenbaum.blogspot.com Twitter: @yokofakun Insititut du Thorax - INSERM UMR915
  • http://en.wikipedia.org/wiki/File:The_Thinker,_Rodin.jpg About me
  • This presentation will be posted on http://www.slideshare.net/lindenb
  • Thank you Biostar ( Istvan Albert,Jeremy Leipzig... ) http://biostar.stackexchange.com/questions/3355
  • “Next” Generation ?
  • http://en.wikipedia.org/wiki/File:ST_TOS_Cast.jpg
  • http://commons.wikimedia.org/wiki/File:Frederick_Sanger2.jpg 1977
  • http://en.wikipedia.org/wiki/File:Sequencing.jpg
  • http://en.wikipedia.org/wiki/Star_Trek:_The_Motion_Picture
  • http://www.flickr.com/photos/widdowquinn/4119516803/
  • http://commons.wikimedia.org/wiki/File:Sanger_sequencing_read_display.gif
  • http://www.nature.com/
  • http://en.wikipedia.org/wiki/Star_Trek_Next_Generation
  • 3 Main Technologies Solid
  • http://www.dkfz.de/gpcf/850.html
  • Credit: Illumina
  • http://www.dkfz.de/gpcf/850.html
  • http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn
  • http://www.dkfz.de/gpcf/849.html
  • http://www.flickr.com/photos/doe_jgi/4093644608
  • The development and impact of 454 sequencing Jonathan M Rothberg & John H Leamon Nature Biotechnology 26, 1117 - 1124 (2008) Published online: 9 October 2008 doi:10.1038/nbt1485
  • Genome Biol. 2009; 10(3): R32. Published online 2009 March 27. doi: 10.1186/gb-2009-10-3-r32. Evaluation of next generation sequencing platforms for population targeted sequencing studies
  • Published online 20 November 2008 | Nature | doi:10.1038/news.2008.1245 Human genomes in minutes? Not yet, but biotechnology company is on track for 2013.
  • Sequencing technologies — the next generation Michael L. Metzker Nature Reviews Genetics 11, 31-46 (January 2010) doi:10.1038/nrg2626
  • Storage
  • http://blogs.forbes.com/sciencebiz/2010/06/03/your-genome-is-coming/
  • Genome Biol. 2010;11(5):207. Epub 2010 May 5. The case for cloud computing in genome informatics.
  • http://www.flickr.com/photos/esquimo_2ooo/5241744434/
  • http://www.flickr.com/photos/jpf/152611490/
  • http://commons.wikimedia.org/wiki/File:Torchlight_zip.png
  • http://www.flickr.com/photos/coreburn/487357814/
  • http://www.cloudera.com/what-is-hadoop/hadoop-overview/
  • The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character $sq, the following Perl code gives the Phred quality $Q: $Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10); http://maq.sourceforge.net/fastq.shtml Solexa/Illumina Read Format
  • Mapping the short reads on A reference genome
  • “ Running these accurate alignment algorithms as a full search of all possible places where the sequence may map is computationally infeasible.” Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) Published online: 15 October 2009 Corrected online: 6 May 2010 doi:10.1038/nmeth.1376
  • HashTable Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) doi:10.1038/nmeth.1376
  • SOAP1 BFAST MOSAIK Hash Reads MAQ Illumina's ELAND Hash Reference
  • Burrows-Wheeler Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) doi:10.1038/nmeth.1376
  • SOAP2 Bowtie BWA
  • http://www.broadinstitute.org/gsa/wiki/index.php/File:ExampleDiagram.png
  • Bruijn graphs Velvet: Algorithms for de novo short read assembly using de Bruijn graphs doi: 10.1101/gr.074492.107 Genome Res. 2008. 18: 821-829
  • Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) doi:10.1038/nmeth.1376
  • CNV detection Genome Res. 2009 Sep;19(9):1586-92. Epub 2009 Aug 5. Sensitive and accurate detection of copy number variants using read depth of coverage.
  • RNA-SEQ http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png gene regulation protein information
  • Exome Sequencing http://en.wikipedia.org/wiki/File:Exome_Sequencing_Workflow_1a.png
  • SAM A generic nucleotide alignment format Bioinformatics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8. The Sequence Alignment/Map format and SAMtools.
  • human-readable, scriptable
  • Field 1: Query name Field 2: Flag Field 3: Reference sequence name Field 4: 1-based leftmost coordinate of the clipped sequence Field 5: Mapping quality Field 6: CIGAR strings Field 7: Mate reference sequence name Field 8: 1-based leftmost coordinate of the clipped sequence Field 9: Insert size (5’ to 5’) Field 10: Query sequence Field 11: Sequence qualities
  • 1 name: SRR018111.1786 2 flag: 83 (read paired/mapped/reverse strand/first in pair) 3 refseq: chr22 4 position: 31232437 5 qual : 17 6 cigar: 76M 7 = 8 clipped pos: 31232403 9 insert size: -110 10 GGCCCTTAAAATCACAAACTATGCTCAACTCACTCTCTACAGCTCTCATAATTTCCAAAATCTATTTTCTT 11 41===@B=AA??B?B@A?BAAAABBBA@B@C<B>B@BBACBBBBBBCBBCABABBCCCBBBBCBABBBCBB 12 XT:A:U 13 NM:i:4 14 SM:i:17 15 AM:i:17 16 X0:i:1 17 X1:i:0 18 XM:i:4 19 XO:i:0 20 XG:i:0 21 MD:Z:6A34T0T8C24
  • Text vs. binary format
  • SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile); SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(), true, outputSamOrBamFile); for ( SAMRecord samRecord : inputSam) { samRecord.setReadName(samRecord.getReadName().toUpperCase()); outputSam.addAlignment(samRecord); } outputSam.close(); inputSam.close();
  • compact, indexed alignments
  • Is flexible enough to store all the alignment information generated by various alignment programs Is simple enough to be easily generated by alignment programs or converted from existing alignment formats Is compact in file size Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
  • CIGAR Compact Idiosyncratic Gapped Alignment Report format 'M' shows a match 'I' shows an insertion 'D' shows a deletions 'H' hard clipping 'S' soft clipping http://www.flickr.com/photos/alexbrn/3032428454/
  • 0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair 0x0002 the read is mapped in a proper pair 0x0004 the query sequence itself is unmapped 0x0008 the mate is unmapped 1 0x0010 strand of the query (0 for forward; 1 for reverse strand) 0x0020 strand of the mate 1 0x0040 the read is the first read in a pair 1,2 0x0080 the read is the second read in a pair 1,2 0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records) 0x0200 the read fails platform/vendor quality checks 0x0400 the read is either a PCR duplicate or an optical duplicate SAM Flags
  • SAMTOOLS http://commons.wikimedia.org/wiki/File:Swiss_Army_Knife_Wenger_Opened_20050627.jpg
  • http://samtools.sourceforge.net/
  • http://gorgonzola.cshl.edu/pfb/2010/LectureNotes/ngs2/ngs2.pdf
  • Pileup seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< Chrom Position Ref Coverage Read bases Qualities
  • Genome (re)sequencing (why ?) http://www.nature.com/news/2008/080122/full/451378b.html
  • Map to known sequence
  • Exome Sequencing: 30,508,378 reads * 55 bp = 1,677,960,790 bb
  • http://vcftools.sourceforge.net/specs.html VCF format
  • GATK
  • http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
  • Visualizing the alignments
  • Samtools: TVIEW
  • http://www.broadinstitute.org/software/igv/
  • http://www.flickr.com/photos/ohm17/162622755/
  • Download FASTA sequence for chr22 (hg18)
  • curl --proxy ${PROXY} &quot;http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr22.fa.gz&quot; | gunzip -c > chr22.fa
  • What's the length of chr22 ?
  • Index chr22 with samtools
  • ${sam.bin} faidx chr22.fa
  • chr22 49691432 7 50 51
  • Get some FastQ files (simulation via samtools)
  • ${sam.dir}/misc/wgsim chr22.fa reads_1.fastq reads_2.fastq > _rand.txt
  • Index chr22 for BWA
  • ${bwa.bin} index -p chr22db -a bwtsw chr22.fa
  • 5, 4 ,3 ,2 , 1 .... Align !
  • ${bwa.bin} aln chr22db reads_1.fastq > aln1.sai ${bwa.bin} aln chr22db reads_2.fastq > aln2.sai
  • Generate alignments in the SAM format given paired-end reads
  • ${bwa.bin} sampe chr22db aln1.sai aln2.sai reads_1.fastq reads_2.fastq | > aln.sam
  • Convert SAM to BAM
  • ${sam.bin} view -b -T chr22.fa aln.sam > aln.bam
  • Sort the alignments by position
  • ${sam.bin} sort aln.bam sorted1
  • Remove the PCR duplicates
  • ${sam.bin} rmdup sorted1.bam sorted2.bam
  • Index the alignment
  • ${sam.bin} index sorted2.bam
  • What's the coverage/depth ?
  • java -jar ${gatk.jar} -T DepthOfCoverage -o file.depth -R chr22.fa -I sorted2.bam
  • GATK: recalibration
  • http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration
  • GATK: local realignment
  • http://www.broadinstitute.org/gsa/wiki/index.php/File:IndelRealignmentAlgorithm.png
  • java -jar ${gatk.jar} -T RealignerTargetCreator -R chr22.fa -o outputs.intervals -I sorted2.bam java -jar ${gatk.jar} -T IndelRealigner -I sorted2.bam -targetIntervals outputs.intervals -o $@ -R chr22.fa .... http://www.flickr.com/photos/didier57/2423562782/
  • Generate a pileup
  • ${sam.bin} pileup -v -c -f chr22.fa realigned.bam > pileup.txt
  • Filter the pileup
  • ${sam.dir}/misc/samtools.pl varFilter -d 5 pileup.txt > pileup.filtered.txt
  • Create a VCF
  • ${sam.dir}/misc/sam2vcf.pl -r chr22.fa < pileup.filtered.txt > pileup.vcf
  • View the alignment with tview
  • http://sift.jcvi.org/www/SIFT_chr_coords_submit.html
  • $1 Coordinates : 4,99981527,1,G/A $2 Codons : - $3 Transcript ID : $4 Protein ID : $5 Substitution : NA $6 Region : NON-GENIC $7 dbSNP ID : NA $8 SNP Type : NA $9 Prediction : Not scored $10 Score : NA $11 Median Info : NA $12 # Seqs at position : NA $13 Gene ID : !N/A $14 Gene Name : !N/A $15 Gene Desc : !N/A $16 Protein Family ID : !N/A $17 Protein Family Desc : !N/A $18 Transcript Status : !N/A $19 Protein Family Size : !N/A $20 OMIM Disease : !N/A $21 Average Allele Freqs : !N/A $22 CEU Allele Freqs : !N/A $23 User Comment : !N/A
  • http://genetics.bwh.harvard.edu/pph2/bgi.shtml
  • $1 #o_snp_id : chr19:1779391.TC.uc010dsr.1 $2 snp_id : chr19:1779391.TC.uc010dsr.1 $3 acc : Q05DB0 $4 pos : 87 $5 aa1 : N $6 aa2 : D $7 prediction : benign $8 pph2_prob : 0.001 $9 pph2_FPR : 0.86 $10 pph2_TPR : 0.994 $11 Comments : !N/A
  • Give Galaxy a try
  • http://main.g2.bx.psu.edu/ Galaxy: A platform for interactive large-scale genome analysis: Genome Res. 2005. 15: 1451-1455
  • Use UCSC Table Browser to find the SNPs
  • Use UCSC mysql server to find the SNPs, the genes,...
  • Create a UCSC Custom Track
  • http://ged.msu.edu/angus/tutorials/ucsc-visualization.html
  • Wig example browser position chr19:59304200-59310700 browser hide all track type=wiggle_0 name=&quot;variableStep&quot; description=&quot;variableStep format&quot; visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5 59306301 15.0 59306691 12.5 59307871 10.0
  • Create a ROR database from the VCF file
  • mkdir -p RAILS rails RAILS/rails4pileup awk -F ' ' 'BEGIN {printf(&quot; create table vcfs(id integer primary key,chrom varchar(50), position int, ref varchar(2), alt varchar(50),depth int);n&quot;);} {printf(&quot;insert into vcfs(chrom,position,ref,alt,depth) values(&quot;%s&quot;,%s,&quot;%s&quot;,&quot;%s&quot;,%s);n&quot;,$$1,$$2,$$3,$$4,$$5);}' pileup.filtered.txt | sqlite3 RAILS/rails4pileup/db/vcf.sqlite3 ruby RAILS/rails4pileup/script/generate scafold vcf chrom:string position:int ref:string alt:string depth:int cat RAILS/rails4pileup/config/database.yml | sed 's/(test|development|production).sqlite3/vcf.sqlite3/' > /tmp/tmp.yml mv /tmp/tmp.yml RAILS/rails4pileup/config/database.yml echo &quot;http://localhost:3000/vcfs&quot;
  • The end.