1. Next-Generation Sequence Analysis
for Biomedical Applications
BIOC 4010/5010
Lecture 1
Dr. Dan Gaston
Postdoctoral Fellow Department of Pathology
Dr. Karen Bedard Lab
3. Overview: Lecture 1
• Why Next-Gen Sequencing Matters
• What is Next-Gen Sequencing
• Bioinformatics Workflows
• Types of Next-Gen Experiments
• Working with the Human Genome
• Slides available on slideshare:
– http://www.slideshare.net/DanGaston
12. Major Areas in Human Disease
Genomics
• Complex Disease
– Genome Wide Association Studies (GWAS)
• Mendelian Disease
– Whole Genome/Exome Sequencing
– Transcriptomics
– Genetic Linkage – Sanger Sequencing
• Cancer
– Tumour Genomics
– Transcriptomics
13. Traditional Diagnosis of Genetic
Disease
• Genetic Counselors/Physicians order
individual testing of genes based on patient
phenotype
• For rare diseases or unusual phenotypes may
run tens to hundreds of tests
• …..EXPENSIVE (Easily thousands of dollars)
14. Next Generation Diagnosis of Genetic
Disease
• NGS-Based Targeted Sequencing Panels
• Clinical Exome
• Clinical Genome
20. Human Genomics: More Power!
• $5,000 - $10,000 to sequence whole genome
– Dropping towards $1000 for sequencing only
• ~$1000 to sequence only protein-coding
portion (exome, later)
21. Clinical Genomics
• Rapid diagnosis of genetic disease in NICU cases
• Quicker and cheaper than sequential genetic
testing (traditional method)
24. Personalized Medicine: Oncology
Tumour Sample
DNA
Non-Tumour
Sample
DNA
Databases and
Annotations
Sequence
Tumour
Specific
Mutations
Tumour
Classification
Drugs
34. FastQ Quality Scores
Quality Score (Q) Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.90%
40 1 in 10000 99.99%
50 1 in 100000 100.00%
Q = -10 log10 P
36. General Genomics Workflow
Quality Control of Raw
Data
Raw Data
Analysis
Alignment to reference
genome
Whole Genome
Mapping
Detection of genetic variation
(SNPs, Indels, SV)
Variant Calling
Linking variants to biological
information
Annotation
37. Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
38. Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
39. Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
40. Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
– Mismatches
• Polymorphisms
• Sequencing errors
41. Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
– Mismatches
• Polymorphisms
• Sequencing errors
– Insertions and deletions
42. Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
– Mismatches
• Polymorphisms
• Sequencing errors
– Insertions and deletions
– May be processing many (100’s) of individuals
44. Short Read Mapping: Brute Force
Method (Stupid)
Simple conceptually: Compare each query k-mer to all k-
mers of genome
Scales with size of the genome and the reads (Not
particularly well)
Genome = AGCATGCTGCAGTCATGCTTAGGCTA
Read = GCT
45. Solution
Index the Reference Genome
Indexing the reference is like constructing a phone
book, quickly move towards the relevant portion of the
genome and ignore the rest.
46. Suffix Array
Split genome into all suffixes (substrings) and sort
alphabetically
Allows query to be searched against an alphabetical
reference, skipping 96% of the genome
Ex: banana$
banana$ $
anana$ a$
nana$ ana$
ana$ anana$
na$ banana$
a$ nana$
$ na$
47. Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
48. Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
49. Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
50. Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
51. Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
52. Binary Search
• Initialize search range to entire list
– mid = (hi+lo)/2; middle = suffix[mid]
– if query matches middle: done
– else if query < middle: pick low range
– else if query > middle: pick hi range
• Repeat until done or empty range
53. Applied to Human Genome
• In practice simple methods of indexing the
genome can create very large data structures
– Suffix Array: > 12 GB
• Solution: Apply complex procedures that allow
you to index and compress the data:
– Burrows-Wheeler Transform
– FM-Index
55. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
Circular
Permutation
56. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Lexicographical
Sort
57. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$Burrows-Wheeler
Matrix
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
58. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
59. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
T(string) = ANNB$AA
Transformed String:
Compressible and Reversible
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
60. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
T(string) = ANNB$AA
Suffix Array
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
6
5
3
1
0
4
2
61. Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
TT(string) = ANNB$AA
FM-Index
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
6
5
3
1
0
4
2
6, 5, 3, 1, 0, 4, 2
+
+
Character Count Tables
62. Short Read Aligners
• BLAT: BLAST-Like Alignment Tool
• MAQ: First to take in to account quality scores
• Bowtie: One of the first to use BWT, ungapped
alignment only
• BWA: One of the first to use BWT. First gapped
BWT, incredibly fast and memory efficient
• Bowtie2: Allows indels
• SOAP, SOAP2: Also use BWT
• … and many more
65. Transcriptomics: RNA-Seq
• Sequence the actively transcribed genes in a
cell line or tissue
– Only about 20% of genes are transcribed in
particular cell types
• Two types:
– Poly-A selection
– Total RNA + ribodepletion
• Many experimental questions can be
addressed
70. RNA-Seq
• Important to take in to account biological
variability. A sample of cells is a mixed population
– Replicates!
• Not suited for discovering polymorphisms due to
higher error rates introduced by reverse
transcription step (RNA -> cDNA)
• High false positive rates for fusion gene discovery,
novel exons, when low expression levels
80. Genetic Variation
• dbSNP (NCBI) build 142
– Catalogs Single Nucleotide Variants (SNV)
– 365 Million Submitted
– 113 Million Validated
– 54 Million in Genes
– 36 Million With Frequency in Populations
• 50-80% of mutations involved in inherited
disease caused by SNVs
– May be an overestimate due to lack of knowledge
81. SNP vs SNV
• Technically a polymorphism is a variation that
doesn’t cause disease and is common in a
population
• What is common?
– Greater than 5% in a population a typical
definition
– Definition for rare ranges from < 0.1% to < 1.0%
83. Frequency of Polymorphisms:
Common vs Rare
• Mendelian disorders are caused by rare
variation, < 1% frequency in the relevant
population
• Leverage large projects aimed at assessing
genetic diversity in populations around the
world
85. Exome Sequencing Project
• Multi-Institutional
• Total possible patient pool of > 250,000
individuals, well phenotyped
– Includes healthy individuals and diseased
• Currently 6700 exomes sequenced
– 4420 European descent
– 2312 African American
• 1.2 million coding variations
– Most extremely rare/unique
– Many population specific
87. Other Resources and Projects
• Exome Aggregation Consortium: 60,000
Exomes
• Personal Genome Project (Ongoing)
• 100,000 Genomes Project (UK, Ongoing)
• BGI (Announced, China): 1 Million Genomes
• Precision Medicine Initiative (US, Announced):
1 Million Genomes
88. Population Matters
• Most variations in protein-coding genes
occurred fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes,
pathogen exposure and urban living
91. Population Matters
• Most variations in protein-coding genes occurred
fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes, pathogen
exposure and urban living
• Monogenic diseases have different prevalence in
different populations
– Cystic fibrosis in European population
– Hereditary Hemochromatosis in Northern Europeans
– Tay-Sachs in Ashkenazi Jews
– Sickle-Cell Anemia in Sub-Saharan African populations
95. Genotype Calling: Determining the
Type of Needle, The Absurdly Simple
Way (Stupid)
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
reference genome
Read depth at base: 10 T: 4 A: 6
Genotype: Heterozygous A/T
96. Genotype Calling: The Absurdly Simple
Way (Stupid)
• Doesn’t account for sequencing error
• Doesn’t account for sequencing bias
• Doesn’t count for bias in short-read mapping
process
• Doesn’t account for mapping error
• Doesn’t consider any external source of
information regarding populations or known
genetic variations
97. Genotype Calling: The Absurdly Simple
Way (Slightly less Stupid)
• Algorithm:
– Count all aligned bases that pass quality threshold
(e.g. >Q20)
– If #reads with alternative base > lower bound (20%)
and < upper bound (80%) call heterozygous alt
– Else if > upper bound call homozygous alternative
– Else call homozygous reference
• …But what about base qualities for more than
keeping reads?
101. What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
102. What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
• Doesn’t account robustly for known sources of
error
103. What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
• Doesn’t account robustly for known sources of
error
• Doesn’t make use of any sources of external
information
104. What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
• Doesn’t account robustly for known sources of
error
• Doesn’t make use of any sources of external
information
• Doesn’t include base qualities
107. Improved Genotype Calling: Prior
Probability
• Known Polymorphic Site?
– Allele Frequencies
• Global rate of polymorphisms
• Other samples
• Substitution Type
108. Substitution Type
• Transition:
– Purine to Purine (A to G)
– Pyrimidine to Pyrimidine (C to T)
• Transversion
– Purine to Pyrimidine
• Transition/Transversion ratio
– Transitions 2x as common (Genome Wide)
– 4x when looking only at exons
– Random Error: 0.5
110. Prior Probability Example
A C G T
A 3.33x10-4 1.11x10-7 6.67x10-4 1.11x10-7
C 8.33x10-5 1.67x10-4 2.78x10-8
G 0.9985 1.67x10-4
T 8.33x10-5
Assume:
Heterozygous SNP Rate of 0.001
Homozygous SNP Rate of 0.0005
Reference: G
Transition/Transversion Ratio: 2
111. Improved Genotype Calling: Error
Rates
Predicted Base
A C G T
Actual
Base
A - 57.7 17.1 25.2
C 34.9 - 11.3 53.9
G 31.9 5.1 - 63.0
T 45.9 22.1 32.0 -
If a base was miscalled, what is it most likely to be called
as instead?
113. Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.
114. Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
115. Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage
116. Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to produce
only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage
– Pro: Won’t miss real variants
– Con: Many more false positives
117. Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to produce
only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage
– Con: False positives
– Pro: Won’t miss real variants
118. How Good Are My Calls?
• How many called SNPs?
– Human average of 1 heterozygous SNP / 1000
bases
• Fraction of variants already in dbSNP
– ~90%
• Transition/Transversion ratio
– Transitions 2x as common
• 3x when looking only at exons
121. Discovering Genetic Variants Causing
Mendelian Disease
4 million genetic variants
2 million associated with
protein-coding genes
10,000 possibly
of disease
causing type
1500 <1%
frequency in
population
Single Causal
Genetic Variant
122. If a problem cannot be
solved, enlarge it.
--Dwight D. Eisenhower
Supreme Commander Allied Forces:
Second World War
34th President USA
132. Example: SIFT Algorithm
Input Query
Sequence
Psi-BLAST
Homologs
Alignment
Multiple
Sequence
Alignment
Multiple
Sequence
Alignment
PSSM
Normalize
By most
frequent AA
Score
133. Prediction Take-Away
The more conserved a site is the more likely
any substitution is to be deleterious
However: Current methods have pretty poor
performance, not suitable for clinical-level
diagnosis
135. Classifying Genetic Variants
4 million
variants
Intronic
Unknown Splice Site
Potential
Disease Causing
Exonic
Amino Acid
Changing
Known Genetic
Disease Variant
Stop Loss / Stop
Gain
Missense
Mutation
Known
Polymorphism in
Population
Silent Mutation Splice Site
Potential
Disease Causing
Intergenic
138. Annotating Genes and Variants
• Is variant in a known protein-coding gene?
– What does the gene do?
– What molecular pathways?
– What protein-protein interactions?
– What tissues is it expressed in?
– When in development?
4 million genetic variants
2 million associated with
protein-coding genes
10,000 possibly
of disease
causing type
1500 <1%
frequency in
population
144. IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
represented population groups and sub-
groups…
145. IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
represented population groups and sub-
groups…
– Acadians
– Native American
– Non-Acadian/European Descent
146. Population Frequency
• Mendelian disorders are rare
• If variation is in database, is it associated with
disease?
• Causal variation also needs to be rare
– Cutoff somewhere in the < 0.1 - < 1% range
– Should appear rarely or not at all in local controls
– Track with disease in family members under study
150. Brain Calcification
• 84 genes in chromosome 5 region
• No likely homozygous or compound heterozygous
variants within region shared between two
patients
• 29 genes with at least one targeted region with
little or no sequencing coverage
• Many only lacked coverage in 5’ and 3’ UTRs
• Collaborators performed statistical tests for
possibly copy-number variations of targeted
regions using exome sequencing data