Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
ss2489@cornell.edu // @SahaSurya
BTI PGRP Summe...
Why Sequencing?
• Targeted interrogation
of genome
• Economical
• Technological
developments
• High-throughput assays
• Bu...
1953
DNA Structure
discovery
1977
2012
Sanger DNA sequencing by
chain-terminating inhibitors
1984
Epstein-Barr
virus
(170 ...
First generation sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Sanger method
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel ...
Sanger method
7/8/2014 BTI PGRP Summer Internship Program 2014 6
http://bit.ly/1g6Cudq
http://bit.ly/1lcQO4J
First generation sequencing
• Very high quality sequences (99.999%)
• Very low throughput
7/8/2014 BTI PGRP Summer Interns...
Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion ...
454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
7/8/2014 BTI PGRP Summer Internship Program 2014 9...
Illumina
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Output 15 Gb 120 GB 1000 GB 1800 GB
Number
of Reads
25 Millio...
Illumina
7/8/2014 BTI PGRP Summer Internship Program 2014 11
http://1.usa.gov/1fP9ybl
Illumina:Moleculo
7/8/2014 BTI PGRP Summer Internship Program 2014 12
http://bit.ly/1aEPOBn
Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 ...
Pacific Biosciences SMRT sequencing
Error correction methods
7/8/2014 BTI PGRP Summer Internship Program 2014 14
Hierarchi...
7/8/2014 Centre for Agricultural Bioinformatics, Pusa 15
Pacific Biosciences SMRT sequencing
Read Lengths
Oxford Nanopore
7/8/2014 Centre for Agricultural Bioinformatics, Pusa 16
https://www.nanoporetech.com/
• No data yet??
• E...
Others
• Ion Torrent Proton/PGM
• Helicos
• Nabsys
• SOLiD
• ……
7/8/2014 BTI PGRP Summer Internship Program 2014 17
Comparison
7/8/2014 BTI PGRP Summer Internship Program 2014 18
Next generation sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 19
Run Time Read Length Quality
Total
nucleoti...
http://omicsmaps.com/
Next Generation Genomics:
World Map of High-throughput Sequencers
BTI PGRP Summer Internship Program...
7/8/2014 BTI PGRP Summer Internship Program 2014 21
Real cost of Sequencing!!
Sboner, Genome Biology, 2011
7/8/2014 22Centre for Agricultural Bioinformatics, Pusa
Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
7/8/2014 BTI PGRP Summer I...
Implications of Choice of Library
7/8/2014 BTI PGRP Summer Internship Program 2014 24
Slide credit: Aureliano Bombarely
Co...
Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
7/8/...
Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleot...
Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and ...
7/8/2014 BTI PGRP Summer Internship Program 2014 28
Quality control: Encoding
Fastq files:
!"#$%&'()*+,-./0123456789 Offse...
Quality control: Encoding
7/8/2014 BTI PGRP Summer Internship Program 2014 29
!"#$%&'()*+,-./0123456789 Offset by 33 (Phre...
7/8/2014 BTI PGRP Summer Internship Program 2014 30
Quality control: Encoding
http://bit.ly/N28yUd
Phred score of a base i...
Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLA...
Pre-processing: Error correction
7/8/2014 BTI PGRP Summer Internship Program 2014 32
Thank you!!
7/8/2014 BTI PGRP Summer Internship Program 2014 33
Upcoming SlideShare
Loading in …5
×

Sequencing

359
-1

Published on

Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
@SahaSurya
BTI PGRP Summer Internship Program 2014

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
359
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sequencing

  1. 1. Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY ss2489@cornell.edu // @SahaSurya BTI PGRP Summer Internship Program 2014 http://www.acgt.me/blog/2014/3/7/next-generation-sequencing-must-die
  2. 2. Why Sequencing? • Targeted interrogation of genome • Economical • Technological developments • High-throughput assays • But requires subsequent validation 7/8/2014 BTI PGRP Summer Internship Program 2014 2
  3. 3. 1953 DNA Structure discovery 1977 2012 Sanger DNA sequencing by chain-terminating inhibitors 1984 Epstein-Barr virus (170 Kb) 1987Abi370 Sequencer 1995 2001 Homo sapiens (3.0 Gb) 2005 454 Solexa Solid 2007 2011 Ion Torrent PacBio Haemophilus influenzae (1.83 Mb) 2013 Slide credit: Aureliano Bombarely Sequencing over the Ages Illumina Illumina Hiseq X 454 7/8/2014 BTI PGRP Summer Internship Program 2014 3 Pinus taeda (24 Gb)
  4. 4. First generation sequencing 7/8/2014 BTI PGRP Summer Internship Program 2014 4
  5. 5. Sanger method 7/8/2014 BTI PGRP Summer Internship Program 2014 5 Frederick Sanger 13 Aug 1918 – 19 Nov 2013 Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977 http://dailym.ai/1f1XeTB
  6. 6. Sanger method 7/8/2014 BTI PGRP Summer Internship Program 2014 6 http://bit.ly/1g6Cudq http://bit.ly/1lcQO4J
  7. 7. First generation sequencing • Very high quality sequences (99.999%) • Very low throughput 7/8/2014 BTI PGRP Summer Internship Program 2014 7 Run Time Read Length Reads / Run Total nucleotides sequenced Cost / MB Capillary Sequencing (ABI3730xl) 20m-3h 400-900 bp 96 or 386 1.9-84 Kb $2400 http://bit.ly/1clLps3 http://1.usa.gov/1cLqIRd
  8. 8. Use the specific technology used to generate the data – Illumina Hiseq/Miseq/NextSeq – Pacific Biosciences RS1/RSII – Ion Torrent Proton/PGM – SOLiD 7/8/2014 BTI PGRP Summer Internship Program 2014 8 http://www.acgt.me/blog/2014/3/10/next-generation- sequencing-must-diepart-2
  9. 9. 454 Pyrosequencing One purified DNA fragment, to one bead, to one read. 7/8/2014 BTI PGRP Summer Internship Program 2014 9 http://bit.ly/1ehwxWN GS FLX Titanium http://bit.ly/1ehAcEh
  10. 10. Illumina 7/8/2014 BTI PGRP Summer Internship Program 2014 10 Output 15 Gb 120 GB 1000 GB 1800 GB Number of Reads 25 Million 400 Million 4 Billion 6 Billion Read Length 2x300 bp 2x150 bp 2x125 bp (2x250 update mid-2014) 2x150 bp Cost $99K $250K $740K $10M Source: Illumina
  11. 11. Illumina 7/8/2014 BTI PGRP Summer Internship Program 2014 11 http://1.usa.gov/1fP9ybl
  12. 12. Illumina:Moleculo 7/8/2014 BTI PGRP Summer Internship Program 2014 12 http://bit.ly/1aEPOBn
  13. 13. Pacific Biosciences SMRT sequencing Single Molecule Real Time sequencing 7/8/2014 BTI PGRP Summer Internship Program 2014 13 http://bit.ly/1naxgTe
  14. 14. Pacific Biosciences SMRT sequencing Error correction methods 7/8/2014 BTI PGRP Summer Internship Program 2014 14 Hierarchical genome-assembly process (HGAP) PBJelly Enlish et al., PLOS One. 2012 PBJelly
  15. 15. 7/8/2014 Centre for Agricultural Bioinformatics, Pusa 15 Pacific Biosciences SMRT sequencing Read Lengths
  16. 16. Oxford Nanopore 7/8/2014 Centre for Agricultural Bioinformatics, Pusa 16 https://www.nanoporetech.com/ • No data yet?? • Error model http://erlichya.tumblr.com/post/66376172948/hands-on- experience-with-oxford-nanopore-minion
  17. 17. Others • Ion Torrent Proton/PGM • Helicos • Nabsys • SOLiD • …… 7/8/2014 BTI PGRP Summer Internship Program 2014 17
  18. 18. Comparison 7/8/2014 BTI PGRP Summer Internship Program 2014 18
  19. 19. Next generation sequencing 7/8/2014 BTI PGRP Summer Internship Program 2014 19 Run Time Read Length Quality Total nucleotides sequenced Cost /MB 454 Pyrosequencing 24h 700 bp Q20-Q30 0.7 GB $10 Illumina Miseq 27h 2x250bp > Q30 15 GB $0.15 Illumina Hiseq 2500 11days 2x125bp >Q30 1000 GB $0.05 Ion torrent 2h 400bp >Q20 50MB-1GB $1 Pacific Biosciences 2h 8.5-20kb >Q30 consensus >Q10 single 400-850MB /SMRT cell $0.33-$1 http://bit.ly/1clLps3 http://1.usa.gov/1cLqIRd
  20. 20. http://omicsmaps.com/ Next Generation Genomics: World Map of High-throughput Sequencers BTI PGRP Summer Internship Program 20147/8/2014 20
  21. 21. 7/8/2014 BTI PGRP Summer Internship Program 2014 21
  22. 22. Real cost of Sequencing!! Sboner, Genome Biology, 2011 7/8/2014 22Centre for Agricultural Bioinformatics, Pusa
  23. 23. Library Types Single end Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2) Mate pair (MP, 2Kb to 20 Kb) 7/8/2014 BTI PGRP Summer Internship Program 2014 23 F F R F R 454/Roche FR Illumina Illumina Slide credit: Aureliano Bombarely
  24. 24. Implications of Choice of Library 7/8/2014 BTI PGRP Summer Internship Program 2014 24 Slide credit: Aureliano Bombarely Consensus sequence (Contig) Reads Scaffold (or Supercontig) Pair Read information NNNNN Pseudomolecule (or ultracontig) F Genetic information (markers) NNNNN NN
  25. 25. Multiplexing Libraries Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector. 7/8/2014 BTI PGRP Summer Internship Program 2014 25 Slide credit: Aureliano Bombarely AGTCGT TGAGCA AGTCGT AGTCGT AGTCGT AGTCGT TGAGCA TGAGCA TGAGCA TGAGCA AGTCGT AGTCGT AGTCGT AGTCGT TGAGCA TGAGCA TGAGCA TGAGCA Sequencing
  26. 26. Fasta files: It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. -Wikipedia File Formats 7/8/2014 BTI PGRP Summer Internship Program 2014 26 Slide credit: Aureliano Bombarely
  27. 27. Fastq files: FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. -Wikipedia • Single line ID with at symbol (“@”) in the first column. • Sequences can be in multiple lines after the ID line • Single line with plus symbol (“+”) in the first column to represent the quality line. • Quality ID line may contain ID • Quality values are in multiple lines after the + line but length should be identical to sequence 7/8/2014 BTI PGRP Summer Internship Program 2014 27 Slide credit: Aureliano Bombarely File Formats
  28. 28. 7/8/2014 BTI PGRP Summer Internship Program 2014 28 Quality control: Encoding Fastq files: !"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33) KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)
  29. 29. Quality control: Encoding 7/8/2014 BTI PGRP Summer Internship Program 2014 29 !"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33) KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)
  30. 30. 7/8/2014 BTI PGRP Summer Internship Program 2014 30 Quality control: Encoding http://bit.ly/N28yUd Phred score of a base is: Qphred = -10 log10 (e) where e is the estimated probability of a base being wrong
  31. 31. Pre-processing: Tools Trimming • FastQC • FASTX toolkit • Trimmomatic • Scythe Joining paired-end reads • fastq-join • FLASH • PANDAseq 7/8/2014 BTI PGRP Summer Internship Program 2014 31
  32. 32. Pre-processing: Error correction 7/8/2014 BTI PGRP Summer Internship Program 2014 32
  33. 33. Thank you!! 7/8/2014 BTI PGRP Summer Internship Program 2014 33

×