1. Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
ss2489@cornell.edu // @SahaSurya
BTI PGRP Intership Program 2015
http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die
6. 1953
DNA
Structure
discovery
1977
2012
Sanger DNA sequencing by
chain-terminating inhibitors
1984
Epstein-Barr
virus
(170 Kb)
1987
Abi370
Sequencer
1995
2001
Homo
sapiens
(3.0 Gb)
2005
454
Solexa
Solid
2007
2011
Ion
Torrent
PacBio
Haemophilus
influenzae
(1.83 Mb)
2013
Slide designcredit: AurelianoBombarely
Sequencing: Then and Now
Illumina
Illumina
Hiseq X
454
6/11/2015 BTI PGRP SummerInternshipProgram2015 6
Pinus
taeda
(24 Gb)
2014
Nanopore
MinION
7. First generation sequencing
6/11/2015 BTI PGRP SummerInternshipProgram2015 7
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention
10. Sanger method (1977)
6/11/2015 BTI PGRP SummerInternshipProgram2015 10
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
12. First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very low throughput
6/11/2015 BTI PGRP SummerInternshipProgram2015 12
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/
15. Mention the specific technology
used to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore Minion
6/11/2015 BTI PGRP SummerInternshipProgram2015 15
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-
diepart-2
16. 454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
6/11/2015 BTI PGRP SummerInternshipProgram2015 16
http://www.genengnews.com/
GS FLX
Titanium
https://mariamuir.com/wp-
content/uploads/2013/04/rip.gif
17. Illumina
6/11/2015 BTI PGRP SummerInternshipProgram2015 17
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800GB
Number
of Reads/
Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read
Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M(10 units)
Source:Illumina
2500
3000
4000
500
18. Illumina
6/11/2015 BTI PlantBioinformaticsCourse 2015 18
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800GB
Number
of Reads/
Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read
Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M(10 units)
Source:Illumina
2500
3000
4000
$1000 human
genome??
500
22. Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
6/11/2015 BTI PGRP SummerInternshipProgram2015 22
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif
23. Pacific Biosciences SMRT sequencing
Error correction methods
6/11/2015 BTI PGRP SummerInternshipProgram2015 23
Hierarchical genome-assembly
process (HGAP)
Englishetal., PLOSOne.2012
PBJelly
25. 6/11/2015 BTI PGRP SummerInternshipProgram2015 25
Pacific Biosciences SMRT sequencing
Read Lengths
http://www.igs.umaryland.edu/labs/grc/
Mean Read Length: 8391 bp
Maximum Subread Length: 24585 bp
41. Real cost of Sequencing!!
Sboner,Genome Biology,2011
6/11/2015 41BTI PGRP SummerInternshipProgram2015
42. Sequencing Data and Concepts
6/11/2015 BTI PGRP SummerInternshipProgram2015 42
43. Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1,Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
6/11/2015 43
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit:AurelianoBombarely
BTI PGRP SummerInternshipProgram2015
44. Implications of Choice of Library
6/11/2015 44
Slide credit:AurelianoBombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN
BTI PGRP SummerInternshipProgram2015
45. Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
6/11/2015 45
Slide credit:AurelianoBombarely
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing
BTI PGRP SummerInternshipProgram2015
46. Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotidesor amino acids are represented using single-lettercodes.
-Wikipedia
File Formats
6/11/2015 46
Slide credit:AurelianoBombarely
BTI PGRP SummerInternshipProgram2015
47. Fastq files:
FASTQ format is a text-based format for storing both a biologicalsequence (usually
nucleotidesequence) and its corresponding qualityscores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
6/11/2015 47
Slide credit:AurelianoBombarely
File Formats
BTI PGRP SummerInternshipProgram2015
49. Quality control: Encoding
6/11/2015 49
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)
BTI PGRP SummerInternshipProgram2015
50. 6/11/2015 50
Quality control: Encoding
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated probabilityof a base
being wrong
BTI PGRP SummerInternshipProgram2015
53. Sequencing done! Now What??
• 1 Hiseq run can produce up to 1500GB or 1.5TB
of data
• How much is 250GB of data?
– 250,000,000,000 characters
– 3000 characters per sheet
– 100 sheets / cm
– Stack of ~8000m
6/11/2015 BTI PGRP SummerInternshipProgram2015 53
Mount Everest - 8848m
54. Increase in Sequencing Data
L. Stein,Genome Biology,2010
6/11/2015 54
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
57. What is bioinformatics?
Bioinformatics /baɪ.oʊˌɪnfərˈmætɪks/is the
applicationof computer science and
information technology to the field of biology
and medicine.
6/11/2015 57
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
58. Bioinformatics deals with
Algorithms, databases and information systems, web
technologies, artificial intelligence and soft computing,
information and computation theory, software
engineering, data mining, image processing, modeling
and simulation, signal processing, discrete mathematics,
control and system theory, circuit theory, and statistics.
Generation of new knowledge in biology and medicine,
and improving & discovering new models of computation
(e.g. DNA computing, neural computing, evolutionary
computing, immuno-computing, swarm-computing,
cellular-computing).
6/11/2015 58
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
59. Bioinformatics can...
Identify similar sequences
Provide a putative function for a sequence
Assemble sequences (genomes, transcriptomes)
Annotate genomes
Identify differentially expressed genes
Build networks of genes or metabolites
Determine phylogenetic relationships
Mine literature for biological information
Uncover differences between two genomes
Calculate how a protein folds
6/11/2015 59
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
60. What can bioinformatics do for me?
Majority of projects involve large datasets
Speed up your research
Enable you to ask new questions
Basic knowledge of bioinformatics needed
Extract information
Transform information
Run analyses
Build hypotheses, etc.
6/11/2015 60
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
62. Linux
UNIX-based, free and open source
operating system
Very stable, easy to use
Created by Linus Torvalds in 1990s
as a student
Adopted for most bioinformatics
work
Also: installed on cell phones,
laptops, desktops,clusters,
supercomputers
Can run on your computer!
Virtualized or native
http://www.linux-netbook.com/linux/distributions/
6/11/2015 62
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
63. Linux
UNIX-based,free and open source operating
system
Very stable, easy to use
Created by Linus Torvalds in 1990s as a student
Adopted for most bioinformaticswork
Also: installed on cell phones, laptops, desktops,
clusters, supercomputers
Can run on your computer!
Virtualized or native
6/11/2015 63BTI PGRP SummerInternshipProgram2015
64. Further Reading
Plant Bioinformatics Course
• Virtual machine setup instructions
• Slides for Linux, Sequencing, RNAseq, NGS Read
Mapping and R graphics
• http://btiplantbioinfocourse.wordpress.com
• 6/11/2015 64
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
65. Scripting
Scripts: Small programs written by the end-
user that control the execution of other
programs or perform a simple algorithm
Extremely flexible
Written in Shell, Perl, Python
You can write them yourself!!!
6/11/2015 65
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
66. Perl
Developed since 1980s by Larry Wall
Useful for bioinformatics and web development
Support for objects
Excellent integration of regular expressions (text
handling language)
Vast open source code library (http:/cpan.org/)
BioPerl (http://bioperl.org/)
Easy to learn
http://www.perl.org/
6/11/2015 66
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
67. Python
Created by Guido van
Rossum in 1989
Very elegant language
BioPython libraries
The “new” popular
language
Many frameworks(Django
for web etc.)
6/11/2015 67
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
68. Language designed for statistics
Support for matrix calculations, graphics
Expression analysis, Next-Gen sequence analysis,
Graphics, genome annotation statistics, phylogeny
Interactive
6/11/2015 68
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015
69. Databases
Need to store and query
data
Biological data is highly
structured
Relational database
systems
Non-relationalsystems
6/11/2015 69
Slide credit:LukasMueller
BTI PGRP SummerInternshipProgram2015