Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20110114 Next Generation Sequencing Course

16,085 views

Published on

Next Generation Sequencing course
2011-01-14 Nantes (By the way, I remember where I found this idea of using Star-Trek: it came from a presentation of the GATK team)

Published in: Technology, Business

20110114 Next Generation Sequencing Course

  1. 1. Next Generation Sequencing Nantes, December 10 th 2010 Pierre Lindenbaum PhD [email_address] http://plindenbaum.blogspot.com Twitter: @yokofakun Insititut du Thorax - INSERM UMR915
  2. 2. http://en.wikipedia.org/wiki/File:The_Thinker,_Rodin.jpg About me
  3. 3. This presentation will be posted on http://www.slideshare.net/lindenb
  4. 4. Thank you Biostar ( Istvan Albert,Jeremy Leipzig... ) http://biostar.stackexchange.com/questions/3355
  5. 5. “Next” Generation ?
  6. 6. http://en.wikipedia.org/wiki/File:ST_TOS_Cast.jpg
  7. 7. http://commons.wikimedia.org/wiki/File:Frederick_Sanger2.jpg 1977
  8. 8. http://en.wikipedia.org/wiki/File:Sequencing.jpg
  9. 9. http://en.wikipedia.org/wiki/Star_Trek:_The_Motion_Picture
  10. 10. http://www.flickr.com/photos/widdowquinn/4119516803/
  11. 11. http://commons.wikimedia.org/wiki/File:Sanger_sequencing_read_display.gif
  12. 13. http://www.nature.com/
  13. 14. http://en.wikipedia.org/wiki/Star_Trek_Next_Generation
  14. 15. 3 Main Technologies Solid
  15. 17. http://www.dkfz.de/gpcf/850.html
  16. 18. Credit: Illumina
  17. 20. http://www.dkfz.de/gpcf/850.html
  18. 21. http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn
  19. 23. http://www.dkfz.de/gpcf/849.html
  20. 27. http://www.flickr.com/photos/doe_jgi/4093644608
  21. 29. The development and impact of 454 sequencing Jonathan M Rothberg & John H Leamon Nature Biotechnology 26, 1117 - 1124 (2008) Published online: 9 October 2008 doi:10.1038/nbt1485
  22. 33. Genome Biol. 2009; 10(3): R32. Published online 2009 March 27. doi: 10.1186/gb-2009-10-3-r32. Evaluation of next generation sequencing platforms for population targeted sequencing studies
  23. 35. Published online 20 November 2008 | Nature | doi:10.1038/news.2008.1245 Human genomes in minutes? Not yet, but biotechnology company is on track for 2013.
  24. 37. Sequencing technologies — the next generation Michael L. Metzker Nature Reviews Genetics 11, 31-46 (January 2010) doi:10.1038/nrg2626
  25. 38. Storage
  26. 40. http://blogs.forbes.com/sciencebiz/2010/06/03/your-genome-is-coming/
  27. 41. Genome Biol. 2010;11(5):207. Epub 2010 May 5. The case for cloud computing in genome informatics.
  28. 42. http://www.flickr.com/photos/esquimo_2ooo/5241744434/
  29. 43. http://www.flickr.com/photos/jpf/152611490/
  30. 44. http://commons.wikimedia.org/wiki/File:Torchlight_zip.png
  31. 45. http://www.flickr.com/photos/coreburn/487357814/
  32. 47. http://www.cloudera.com/what-is-hadoop/hadoop-overview/
  33. 50. FASTQ
  34. 51. @IL31_4368:1:1:996:8507/2 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA + FFCEFFFEEFFFFFFFEFFEFFFEFCFC<EEFEFFFCEFF<;EEFF=FEE?FCE @IL31_4368:1:1:996:21421/2 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC + >DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;5<87+*=/*@@?9=73=.7)7* @IL31_4368:1:1:997:10572/2 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG + E?=EECE<EEEE98EEEEAEEBD??BE@AEAB><EEABCEEDEC<<EBDA=DEE @IL31_4368:1:1:997:15684/2 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC + EEEEDEEE9EAEEDEEEEEEEEEECEEAAEEDEE<CD=D=*BCAC?;CB,<D@, @IL31_4368:1:1:997:15249/2 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT + EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D>+EE4E7EEE4;E=EA @IL31_4368:1:1:997:6273/2 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG + EEAAFFFEEFEFCFAFFAFCCFFEFEF>EFFFFB?ABA@ECEE=<F@DE@DDF; @IL31_4368:1:1:997:1657/2 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT (...)
  35. 52. The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character $sq, the following Perl code gives the Phred quality $Q: $Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10); http://maq.sourceforge.net/fastq.shtml Solexa/Illumina Read Format
  36. 54. Mapping the short reads on A reference genome
  37. 55. “ Running these accurate alignment algorithms as a full search of all possible places where the sequence may map is computationally infeasible.” Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) Published online: 15 October 2009 Corrected online: 6 May 2010 doi:10.1038/nmeth.1376
  38. 56. HashTable Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) doi:10.1038/nmeth.1376
  39. 57. SOAP1 BFAST MOSAIK Hash Reads MAQ Illumina's ELAND Hash Reference
  40. 58. Burrows-Wheeler Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) doi:10.1038/nmeth.1376
  41. 59. SOAP2 Bowtie BWA
  42. 60. http://www.broadinstitute.org/gsa/wiki/index.php/File:ExampleDiagram.png
  43. 61. DE NOVO SEQUENCING
  44. 62. Bruijn graphs Velvet: Algorithms for de novo short read assembly using de Bruijn graphs doi: 10.1101/gr.074492.107 Genome Res. 2008. 18: 821-829
  45. 63. Sense from sequence reads: methods for alignment and assembly Paul Flicek & Ewan Birney Nature Methods 6, S6 - S12 (2009) doi:10.1038/nmeth.1376
  46. 64. CNV detection Genome Res. 2009 Sep;19(9):1586-92. Epub 2009 Aug 5. Sensitive and accurate detection of copy number variants using read depth of coverage.
  47. 65. RNA-SEQ http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png gene regulation protein information
  48. 66. Exome Sequencing http://en.wikipedia.org/wiki/File:Exome_Sequencing_Workflow_1a.png
  49. 67. SAM A generic nucleotide alignment format Bioinformatics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8. The Sequence Alignment/Map format and SAMtools.
  50. 68. human-readable, scriptable
  51. 69. Field 1: Query name Field 2: Flag Field 3: Reference sequence name Field 4: 1-based leftmost coordinate of the clipped sequence Field 5: Mapping quality Field 6: CIGAR strings Field 7: Mate reference sequence name Field 8: 1-based leftmost coordinate of the clipped sequence Field 9: Insert size (5’ to 5’) Field 10: Query sequence Field 11: Sequence qualities
  52. 70. 1 name: SRR018111.1786 2 flag: 83 (read paired/mapped/reverse strand/first in pair) 3 refseq: chr22 4 position: 31232437 5 qual : 17 6 cigar: 76M 7 = 8 clipped pos: 31232403 9 insert size: -110 10 GGCCCTTAAAATCACAAACTATGCTCAACTCACTCTCTACAGCTCTCATAATTTCCAAAATCTATTTTCTT 11 41===@B=AA??B?B@A?BAAAABBBA@B@C<B>B@BBACBBBBBBCBBCABABBCCCBBBBCBABBBCBB 12 XT:A:U 13 NM:i:4 14 SM:i:17 15 AM:i:17 16 X0:i:1 17 X1:i:0 18 XM:i:4 19 XO:i:0 20 XG:i:0 21 MD:Z:6A34T0T8C24
  53. 71. Text vs. binary format
  54. 72. SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile); SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(), true, outputSamOrBamFile); for ( SAMRecord samRecord : inputSam) { samRecord.setReadName(samRecord.getReadName().toUpperCase()); outputSam.addAlignment(samRecord); } outputSam.close(); inputSam.close();
  55. 73. compact, indexed alignments
  56. 74. Is flexible enough to store all the alignment information generated by various alignment programs Is simple enough to be easily generated by alignment programs or converted from existing alignment formats Is compact in file size Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
  57. 75. CIGAR Compact Idiosyncratic Gapped Alignment Report format 'M' shows a match 'I' shows an insertion 'D' shows a deletions 'H' hard clipping 'S' soft clipping http://www.flickr.com/photos/alexbrn/3032428454/
  58. 76. 0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair 0x0002 the read is mapped in a proper pair 0x0004 the query sequence itself is unmapped 0x0008 the mate is unmapped 1 0x0010 strand of the query (0 for forward; 1 for reverse strand) 0x0020 strand of the mate 1 0x0040 the read is the first read in a pair 1,2 0x0080 the read is the second read in a pair 1,2 0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records) 0x0200 the read fails platform/vendor quality checks 0x0400 the read is either a PCR duplicate or an optical duplicate SAM Flags
  59. 77. SAMTOOLS http://commons.wikimedia.org/wiki/File:Swiss_Army_Knife_Wenger_Opened_20050627.jpg
  60. 78. http://samtools.sourceforge.net/
  61. 79. http://gorgonzola.cshl.edu/pfb/2010/LectureNotes/ngs2/ngs2.pdf
  62. 80. Pileup seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< Chrom Position Ref Coverage Read bases Qualities
  63. 81. Genome (re)sequencing (why ?) http://www.nature.com/news/2008/080122/full/451378b.html
  64. 82. Map to known sequence
  65. 84. Exome Sequencing: 30,508,378 reads * 55 bp = 1,677,960,790 bb
  66. 85. http://vcftools.sourceforge.net/specs.html VCF format
  67. 86. GATK
  68. 87. http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
  69. 89. Visualizing the alignments
  70. 90. Samtools: TVIEW
  71. 91. http://www.broadinstitute.org/software/igv/
  72. 92. http://www.flickr.com/photos/ohm17/162622755/
  73. 96. Download FASTA sequence for chr22 (hg18)
  74. 97. curl --proxy ${PROXY} &quot;http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr22.fa.gz&quot; | gunzip -c > chr22.fa
  75. 98. What's the length of chr22 ?
  76. 99. Index chr22 with samtools
  77. 100. ${sam.bin} faidx chr22.fa
  78. 101. chr22 49691432 7 50 51
  79. 102. Get some FastQ files (simulation via samtools)
  80. 103. ${sam.dir}/misc/wgsim chr22.fa reads_1.fastq reads_2.fastq > _rand.txt
  81. 104. Index chr22 for BWA
  82. 105. ${bwa.bin} index -p chr22db -a bwtsw chr22.fa
  83. 106. 5, 4 ,3 ,2 , 1 .... Align !
  84. 107. ${bwa.bin} aln chr22db reads_1.fastq > aln1.sai ${bwa.bin} aln chr22db reads_2.fastq > aln2.sai
  85. 108. Generate alignments in the SAM format given paired-end reads
  86. 109. ${bwa.bin} sampe chr22db aln1.sai aln2.sai reads_1.fastq reads_2.fastq | > aln.sam
  87. 110. Convert SAM to BAM
  88. 111. ${sam.bin} view -b -T chr22.fa aln.sam > aln.bam
  89. 112. Sort the alignments by position
  90. 113. ${sam.bin} sort aln.bam sorted1
  91. 114. Remove the PCR duplicates
  92. 115. ${sam.bin} rmdup sorted1.bam sorted2.bam
  93. 116. Index the alignment
  94. 117. ${sam.bin} index sorted2.bam
  95. 118. What's the coverage/depth ?
  96. 119. java -jar ${gatk.jar} -T DepthOfCoverage -o file.depth -R chr22.fa -I sorted2.bam
  97. 120. GATK: recalibration
  98. 121. http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration
  99. 122. GATK: local realignment
  100. 123. http://www.broadinstitute.org/gsa/wiki/index.php/File:IndelRealignmentAlgorithm.png
  101. 124. java -jar ${gatk.jar} -T RealignerTargetCreator -R chr22.fa -o outputs.intervals -I sorted2.bam java -jar ${gatk.jar} -T IndelRealigner -I sorted2.bam -targetIntervals outputs.intervals -o $@ -R chr22.fa .... http://www.flickr.com/photos/didier57/2423562782/
  102. 125. Generate a pileup
  103. 126. ${sam.bin} pileup -v -c -f chr22.fa realigned.bam > pileup.txt
  104. 127. Filter the pileup
  105. 128. ${sam.dir}/misc/samtools.pl varFilter -d 5 pileup.txt > pileup.filtered.txt
  106. 129. Create a VCF
  107. 130. ${sam.dir}/misc/sam2vcf.pl -r chr22.fa < pileup.filtered.txt > pileup.vcf
  108. 131. View the alignment with tview
  109. 132. http://sift.jcvi.org/www/SIFT_chr_coords_submit.html
  110. 133. $1 Coordinates : 4,99981527,1,G/A $2 Codons : - $3 Transcript ID : $4 Protein ID : $5 Substitution : NA $6 Region : NON-GENIC $7 dbSNP ID : NA $8 SNP Type : NA $9 Prediction : Not scored $10 Score : NA $11 Median Info : NA $12 # Seqs at position : NA $13 Gene ID : !N/A $14 Gene Name : !N/A $15 Gene Desc : !N/A $16 Protein Family ID : !N/A $17 Protein Family Desc : !N/A $18 Transcript Status : !N/A $19 Protein Family Size : !N/A $20 OMIM Disease : !N/A $21 Average Allele Freqs : !N/A $22 CEU Allele Freqs : !N/A $23 User Comment : !N/A
  111. 134. http://genetics.bwh.harvard.edu/pph2/bgi.shtml
  112. 135. $1 #o_snp_id : chr19:1779391.TC.uc010dsr.1 $2 snp_id : chr19:1779391.TC.uc010dsr.1 $3 acc : Q05DB0 $4 pos : 87 $5 aa1 : N $6 aa2 : D $7 prediction : benign $8 pph2_prob : 0.001 $9 pph2_FPR : 0.86 $10 pph2_TPR : 0.994 $11 Comments : !N/A
  113. 136. Give Galaxy a try
  114. 137. http://main.g2.bx.psu.edu/ Galaxy: A platform for interactive large-scale genome analysis: Genome Res. 2005. 15: 1451-1455
  115. 138. Use UCSC Table Browser to find the SNPs
  116. 139. Use UCSC mysql server to find the SNPs, the genes,...
  117. 140. Create a UCSC Custom Track
  118. 141. http://ged.msu.edu/angus/tutorials/ucsc-visualization.html
  119. 142. Wig example browser position chr19:59304200-59310700 browser hide all track type=wiggle_0 name=&quot;variableStep&quot; description=&quot;variableStep format&quot; visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5 59306301 15.0 59306691 12.5 59307871 10.0
  120. 143. Create a ROR database from the VCF file
  121. 144. mkdir -p RAILS rails RAILS/rails4pileup awk -F ' ' 'BEGIN {printf(&quot; create table vcfs(id integer primary key,chrom varchar(50), position int, ref varchar(2), alt varchar(50),depth int);n&quot;);} {printf(&quot;insert into vcfs(chrom,position,ref,alt,depth) values(&quot;%s&quot;,%s,&quot;%s&quot;,&quot;%s&quot;,%s);n&quot;,$$1,$$2,$$3,$$4,$$5);}' pileup.filtered.txt | sqlite3 RAILS/rails4pileup/db/vcf.sqlite3 ruby RAILS/rails4pileup/script/generate scafold vcf chrom:string position:int ref:string alt:string depth:int cat RAILS/rails4pileup/config/database.yml | sed 's/(test|development|production).sqlite3/vcf.sqlite3/' > /tmp/tmp.yml mv /tmp/tmp.yml RAILS/rails4pileup/config/database.yml echo &quot;http://localhost:3000/vcfs&quot;
  122. 145. The end.

×