Sequencing and Bioinformatics PGRP Summer 2015

Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
ss2489@cornell.edu // @SahaSurya
BTI PGRP Intership Program 2015
http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Hello Experiment!
• Experimental design for survey
Sample size
Locations
Phenotypes
6/11/2015 BTI PGRP SummerInternshipProgram2015 2
Early Blight infected tomato plants
http://www.longislandhort.cornell.edu/vegpath/photos/early_blight.htm

Hello Experiment!
• Experimental design for survey
Sample size
Locations
Phenotypes
• Experimental design to identify
genetic differences
PCR-based
• Simple Sequence Repeats
• Other markers
Sequencing-based
• Genes of interest
• Single Nucleotide Polymorphisms
• Gene expression
• Genotyping by Sequencing
Early Blight infected tomato plants
http://www.longislandhort.cornell.edu/vegpath/photos/early_blight.htm

Why Sequencing?
• Targeted interrogation
of genome
• Economical
• Technological
developments
• High-throughput assays
• But requires subsequent
validation

1953
DNA
Structure
discovery
1977
2012
Sanger DNA sequencing by
chain-terminating inhibitors
1984
Epstein-Barr
virus
(170 Kb)
1987
Abi370
Sequencer
1995
2001
Homo
sapiens
(3.0 Gb)
2005
454
Solexa
Solid
2007
2011
Ion
Torrent
PacBio
Haemophilus
influenzae
(1.83 Mb)
2013
Slide designcredit: AurelianoBombarely
Sequencing: Then and Now
Illumina
Illumina
Hiseq X
454
Pinus
taeda
(24 Gb)
2014
Nanopore
MinION

First generation sequencing
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention

Maxam-Gilbert method (1973)

Maxam-Gilbert method (1973)
http://en.wikipedia.org/wiki/File:Maxam-
Gilbert_sequencing_en.svg
https://www.nationaldiagnostics.com/electrophoresis
/article/maxam-gilbert-sequencing

Sanger method (1977)
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB

Sanger method (1977)
http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg
http://en.wikipedia.org/wiki/File:
Radioactive_Fluorescent_Seq.jpg

First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very low throughput
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/

Next generation sequencing

https://twitter.com/kbradnam/status/443153578429923328
• Second generation
• Third generation
• Fourth generation
• Next-next-generation
• Next-next-next
generation
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2

Mention the specific technology
used to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore Minion
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-
diepart-2

454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
http://www.genengnews.com/
GS FLX
Titanium
https://mariamuir.com/wp-
content/uploads/2013/04/rip.gif

Illumina
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800GB
Number
of Reads/
Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read
Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M(10 units)
Source:Illumina
2500
3000
4000
500

Illumina
6/11/2015 BTI PlantBioinformaticsCourse 2015 18
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800GB
Number
of Reads/
Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read
Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M(10 units)
Source:Illumina
2500
3000
4000
$1000 human
genome??
500

Illumina
Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Illumina:TruSeqLongRead
Voskoboynik eLife2013;2:e00569

Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif

Error correction methods
Hierarchical genome-assembly
process (HGAP)
Englishetal., PLOSOne.2012
PBJelly

Error correction methods
PBcRPipeline

Read Lengths
http://www.igs.umaryland.edu/labs/grc/
Mean Read Length: 8391 bp
Maximum Subread Length: 24585 bp

Read Lengths

Genome Assembly with Long Reads

Oxford Nanopore
https://www.nanoporetech.com/
http://erlichya.tumblr.com/post/66376172948/hands-on-
experience-with-oxford-nanopore-minion
http://halegrafx.com/vector-art/free-vector-despicable-me-minions/

Oxford Nanopore

Oxford Nanopore
https://theconversation.com/how-a-small-backpack-for-fast-genomic-sequencing-is-helping-
combat-ebola-41863

Sequencing Trends
https://www.google.com/trends/

0
5000
10000
15000
20000
25000
30000
2008 2009 2010 2011 2012 2013 2014
Number of Publications
Illumina Pacific Biosciences Roche 454 Ion Torrent
-2000
-1000
0
1000
2000
3000
4000
5000
6000
2009 2010 2011 2012 2013 2014
Increasein Number of Publications
Illumina Pacific Biosciences Roche 454 Ion Torrent
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
2009 2010 2011 2012 2013 2014
% Increasein Number of Publications
Pacific Biosciences Roche 454 Ion Torrent

Hi-C Crosslinking

Others
• Ion Torrent Proton/PGM
• SOLiD
• Helicos
• Supporting technologies
– BioNano
– Nabsys
– OpGen
– 10X Genomics
– Fluidigm

Comparison

Next generation sequencing
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost/MB
454
Pyrosequencing
24h 700 bp Q20-Q30 1 GB $10
Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15
Illumina Hiseq
2500
1 - 10days 2x250bp >Q30 3000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences
30m - 4h 10kb - >40kb
>Q50 consensus
>Q10 single
500 - 1000MB
/SMRT cell
$0.13 - $0.60
http://www.hindawi.com/journals/bmri/2012/251364/
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227

http://omicsmaps.com/
Next Generation Genomics:
World Map of High-throughput Sequencers
BTI PGRP SummerInternshipProgram20156/11/2015 38

https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

Real cost of Sequencing!!
Sboner,Genome Biology,2011
6/11/2015 41BTI PGRP SummerInternshipProgram2015

Sequencing Data and Concepts

Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1,Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
6/11/2015 43
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit:AurelianoBombarely
BTI PGRP SummerInternshipProgram2015

Implications of Choice of Library
6/11/2015 44
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN

Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
6/11/2015 45
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing

Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotidesor amino acids are represented using single-lettercodes.
-Wikipedia
File Formats
6/11/2015 46

Fastq files:
FASTQ format is a text-based format for storing both a biologicalsequence (usually
nucleotidesequence) and its corresponding qualityscores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
6/11/2015 47
File Formats

6/11/2015 48
Quality control: Encoding
Fastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)

6/11/2015 49
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)

6/11/2015 50
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated probabilityof a base
being wrong

Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLASH
• PANDAseq

Sequencing done!
Now What??

Sequencing done! Now What??
• 1 Hiseq run can produce up to 1500GB or 1.5TB
of data
• How much is 250GB of data?
– 250,000,000,000 characters
– 3000 characters per sheet
– 100 sheets / cm
– Stack of ~8000m
Mount Everest - 8848m

Increase in Sequencing Data
L. Stein,Genome Biology,2010
6/11/2015 54
Slide credit:LukasMueller

Big Data
6/11/2015 55

High Performance
Computing
Powerful servers with large
amounts of memory,
compute cores, and disk

What is bioinformatics?
 Bioinformatics /baɪ.oʊˌɪnfərˈmætɪks/is the
applicationof computer science and
information technology to the field of biology
and medicine.
6/11/2015 57

Bioinformatics deals with
 Algorithms, databases and information systems, web
technologies, artificial intelligence and soft computing,
information and computation theory, software
engineering, data mining, image processing, modeling
and simulation, signal processing, discrete mathematics,
control and system theory, circuit theory, and statistics.
 Generation of new knowledge in biology and medicine,
and improving & discovering new models of computation
(e.g. DNA computing, neural computing, evolutionary
computing, immuno-computing, swarm-computing,
cellular-computing).
6/11/2015 58

Bioinformatics can...
 Identify similar sequences
 Provide a putative function for a sequence
 Assemble sequences (genomes, transcriptomes)
 Annotate genomes
 Identify differentially expressed genes
 Build networks of genes or metabolites
 Determine phylogenetic relationships
 Mine literature for biological information
 Uncover differences between two genomes
 Calculate how a protein folds
6/11/2015 59

What can bioinformatics do for me?
 Majority of projects involve large datasets
 Speed up your research
 Enable you to ask new questions
 Basic knowledge of bioinformatics needed
 Extract information
 Transform information
 Run analyses
 Build hypotheses, etc.
6/11/2015 60

Linux
 UNIX-based, free and open source
operating system
 Very stable, easy to use
 Created by Linus Torvalds in 1990s
as a student
 Adopted for most bioinformatics
work
 Also: installed on cell phones,
laptops, desktops,clusters,
supercomputers
 Can run on your computer!
 Virtualized or native
http://www.linux-netbook.com/linux/distributions/
6/11/2015 62

Linux
 UNIX-based,free and open source operating
system
 Very stable, easy to use
 Created by Linus Torvalds in 1990s as a student
 Adopted for most bioinformaticswork
 Also: installed on cell phones, laptops, desktops,
clusters, supercomputers
 Can run on your computer!
 Virtualized or native

Further Reading
Plant Bioinformatics Course
• Virtual machine setup instructions
• Slides for Linux, Sequencing, RNAseq, NGS Read
Mapping and R graphics
• http://btiplantbioinfocourse.wordpress.com
• 6/11/2015 64

Scripting
 Scripts: Small programs written by the end-
user that control the execution of other
programs or perform a simple algorithm
 Extremely flexible
 Written in Shell, Perl, Python
 You can write them yourself!!!
6/11/2015 65

Perl
 Developed since 1980s by Larry Wall
 Useful for bioinformatics and web development
 Support for objects
 Excellent integration of regular expressions (text
handling language)
 Vast open source code library (http:/cpan.org/)
 BioPerl (http://bioperl.org/)
 Easy to learn
 http://www.perl.org/
6/11/2015 66

Python
 Created by Guido van
Rossum in 1989
 Very elegant language
 BioPython libraries
 The “new” popular
language
 Many frameworks(Django
for web etc.)
6/11/2015 67

 Language designed for statistics
 Support for matrix calculations, graphics
 Expression analysis, Next-Gen sequence analysis,
Graphics, genome annotation statistics, phylogeny
 Interactive
6/11/2015 68

Databases
 Need to store and query
data
 Biological data is highly
structured
 Relational database
systems
 Non-relationalsystems
6/11/2015 69

Thank you!!

Sequencing and Bioinformatics PGRP Summer 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Sequencing and Bioinformatics PGRP Summer 2015

Similar to Sequencing and Bioinformatics PGRP Summer 2015 (20)

More from Surya Saha

More from Surya Saha (20)

Recently uploaded

Recently uploaded (20)

Sequencing and Bioinformatics PGRP Summer 2015