SlideShare a Scribd company logo
1 of 159
Next-Generation Sequence Analysis
for Biomedical Applications
BIOC 4010/5010
Lecture 1
Dr. Dan Gaston
Postdoctoral Fellow Department of Pathology
Dr. Karen Bedard Lab
LECTURE 1
Introduction to Next-Gen Sequencing
Overview: Lecture 1
• Why Next-Gen Sequencing Matters
• What is Next-Gen Sequencing
• Bioinformatics Workflows
• Types of Next-Gen Experiments
• Working with the Human Genome
• Slides available on slideshare:
– http://www.slideshare.net/DanGaston
Personalized Medicine
Major Areas in Human Disease
Genomics
• Complex Disease
– Genome Wide Association Studies (GWAS)
• Mendelian Disease
– Whole Genome/Exome Sequencing
– Transcriptomics
– Genetic Linkage – Sanger Sequencing
• Cancer
– Tumour Genomics
– Transcriptomics
Traditional Diagnosis of Genetic
Disease
• Genetic Counselors/Physicians order
individual testing of genes based on patient
phenotype
• For rare diseases or unusual phenotypes may
run tens to hundreds of tests
• …..EXPENSIVE (Easily thousands of dollars)
Next Generation Diagnosis of Genetic
Disease
• NGS-Based Targeted Sequencing Panels
• Clinical Exome
• Clinical Genome
Genetic Disease Research: The Slow and
Traditional Way in the Dark Ages (circa 2009)
Genetic Disease Research: Cutis Laxa
Chromosome 9:
120,962,282 -133,033,431
Cutis Laxa
• Linked Genomic Region ~13Mb in size
• Contains 143 Genes
• Prioritize and select genes for individual
sanger sequencing
• …Slow
• …Laborious
• …Can be expensive
Human Genomics: More Power!
Human Genomics: More Power!
Human Genomics: More Power!
• $5,000 - $10,000 to sequence whole genome
– Dropping towards $1000 for sequencing only
• ~$1000 to sequence only protein-coding
portion (exome, later)
Clinical Genomics
• Rapid diagnosis of genetic disease in NICU cases
• Quicker and cheaper than sequential genetic
testing (traditional method)
Clinical Genomics
Clinical Genomics
Personalized Medicine: Oncology
Tumour Sample
DNA
Non-Tumour
Sample
DNA
Databases and
Annotations
Sequence
Tumour
Specific
Mutations
Tumour
Classification
Drugs
Personalized Medicine: Oncology
Welch JS, et al. JAMA, 2011;305, 1577
Personalized Medicine: Monitoring For
Cancer Chemotherapy Resistance
Composition of Human Genome
Size: 3.2 Gb
Genomic Content
Chromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA
1 249,250,621 4,401,091 2,012 31 1,130 134 66 106
2 243,199,373 4,607,702 1,203 50 948 115 40 93
3 198,022,430 3,894,345 1,040 25 719 99 29 77
4 191,154,276 3,673,892 718 39 698 92 24 71
5 180,915,260 3,436,667 849 24 676 83 25 68
6 171,115,067 3,360,890 1,002 39 731 81 26 67
7 159,138,663 3,045,992 866 34 803 90 24 70
8 146,364,022 2,890,692 659 39 568 80 28 42
9 141,213,431 2,581,827 785 15 714 69 19 55
10 135,534,747 2,609,802 745 18 500 64 32 56
11 135,006,516 2,607,254 1,258 48 775 63 24 53
12 133,851,895 2,482,194 1,003 47 582 72 27 69
13 115,169,878 1,814,242 318 8 323 42 16 36
14 107,349,540 1,712,799 601 50 472 92 10 46
15 102,531,392 1,577,346 562 43 473 78 13 39
16 90,354,753 1,747,136 805 65 429 52 32 34
17 81,195,210 1,491,841 1,158 44 300 61 15 46
18 78,077,248 1,448,602 268 20 59 32 13 25
19 59,128,983 1,171,356 1,399 26 181 110 13 15
20 63,025,520 1,206,753 533 13 213 57 15 34
21 48,129,895 787,784 225 8 150 16 5 8
22 51,304,566 745,778 431 21 308 31 5 23
X 155,270,560 2,174,952 815 23 780 128 22 52
Y 59,373,566 286,812 45 8 327 15 7 2
mtDNA 16,569 929 13 0 0 0 2 22
Exome and Genome Sequencing
Short Reads
Millions of “short reads” 75-
150bp each
Usually “paired”
FastQ Format
Read ID
Sequence
Quality line
FastQ Quality Scores
Quality Score (Q) Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.90%
40 1 in 10000 99.99%
50 1 in 100000 100.00%
Q = -10 log10 P
Quality Scores of Sequencing Reads
General Genomics Workflow
Quality Control of Raw
Data
Raw Data
Analysis
Alignment to reference
genome
Whole Genome
Mapping
Detection of genetic variation
(SNPs, Indels, SV)
Variant Calling
Linking variants to biological
information
Annotation
Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
– Mismatches
• Polymorphisms
• Sequencing errors
Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
– Mismatches
• Polymorphisms
• Sequencing errors
– Insertions and deletions
Find the Location of Each Read in the
Genome
• Problems:
– Short sequence
– Millions of short sequences
– Big genome
– Mismatches
• Polymorphisms
• Sequencing errors
– Insertions and deletions
– May be processing many (100’s) of individuals
Short Read Mapping
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
GCGCCCTA
GCCCTATCG
GCCCTATCG
CCTATCGGA
CTATCGGAAA
AAATTTGC
AAATTTGC
TTTGCGGT
TTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATT CGGTATAC
TAGGCTATA
AGGCTATAT
AGGCTATAT
AGGCTATAT
GGCTATATG
CTATATGCG
…CC
…CC
…CCA
…CCA
…CCAT
ATAC…
C…
C…
…CCAT
1) Report location of genome where read matches best
2) Minimize mismatches
3) Mismatches with lower quality bases better than
mismatches with higher quality bases
Short Read Mapping: Brute Force
Method (Stupid)
Simple conceptually: Compare each query k-mer to all k-
mers of genome
Scales with size of the genome and the reads (Not
particularly well)
Genome = AGCATGCTGCAGTCATGCTTAGGCTA
Read = GCT
Solution
Index the Reference Genome
Indexing the reference is like constructing a phone
book, quickly move towards the relevant portion of the
genome and ignore the rest.
Suffix Array
Split genome into all suffixes (substrings) and sort
alphabetically
Allows query to be searched against an alphabetical
reference, skipping 96% of the genome
Ex: banana$
banana$ $
anana$ a$
nana$ ana$
ana$ anana$
na$ banana$
a$ nana$
$ na$
Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a
problem…
Index # Sequence Pos Pos
1 ACAGATTACC… 6
2 ACC… 13
3 AGATTACC… 8
4 ATTACAGATTACC… 3
5 ATTACC… 10
6 C… 15
7 CAGATTACC… 7
8 CC… 14
9 GATTACAGATTACC… 2
10 GATTACC… 9
11 TACAGATTACC… 5
12 TACC… 12
13 TGATTACAGATTACC… 1
14 TTACAGATTACC… 4
15 TTACC… 11
Search for GATTACA…
Binary Search
• Initialize search range to entire list
– mid = (hi+lo)/2; middle = suffix[mid]
– if query matches middle: done
– else if query < middle: pick low range
– else if query > middle: pick hi range
• Repeat until done or empty range
Applied to Human Genome
• In practice simple methods of indexing the
genome can create very large data structures
– Suffix Array: > 12 GB
• Solution: Apply complex procedures that allow
you to index and compress the data:
– Burrows-Wheeler Transform
– FM-Index
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
Circular
Permutation
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Lexicographical
Sort
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$Burrows-Wheeler
Matrix
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
T(string) = ANNB$AA
Transformed String:
Compressible and Reversible
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
T(string) = ANNB$AA
Suffix Array
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
6
5
3
1
0
4
2
Burrows-Wheeler Transform
• Similar in many ways to creation of Suffix
Array
BANANA$
TT(string) = ANNB$AA
FM-Index
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
6
5
3
1
0
4
2
6, 5, 3, 1, 0, 4, 2
+
+
Character Count Tables
Short Read Aligners
• BLAT: BLAST-Like Alignment Tool
• MAQ: First to take in to account quality scores
• Bowtie: One of the first to use BWT, ungapped
alignment only
• BWA: One of the first to use BWT. First gapped
BWT, incredibly fast and memory efficient
• Bowtie2: Allows indels
• SOAP, SOAP2: Also use BWT
• … and many more
Next-Gen Sequencing Experiments
• Whole Genome Sequencing
• Targeted Exome Sequencing
• RNA-Seq
• ChIP-Seq
• CLIP-Seq
Exome Sequencing
Transcriptomics: RNA-Seq
• Sequence the actively transcribed genes in a
cell line or tissue
– Only about 20% of genes are transcribed in
particular cell types
• Two types:
– Poly-A selection
– Total RNA + ribodepletion
• Many experimental questions can be
addressed
RNA-Seq: Gene Expression
Condition 1
Condition 2
RNA-Seq: Differential Splicing
Exon1 Exon 2 Exon 3
RNA-Seq: Novel/Non-Canonical Exon
Discovery
Exon1 Exon 2 Exon 3Exon X
RNA-Seq: Gene Fusion Events
Exon1 Exon 2 Exon 3
Gene 2 Exon 4
RNA-Seq
• Important to take in to account biological
variability. A sample of cells is a mixed population
– Replicates!
• Not suited for discovering polymorphisms due to
higher error rates introduced by reverse
transcription step (RNA -> cDNA)
• High false positive rates for fusion gene discovery,
novel exons, when low expression levels
CHiP-Seq
CHiP-Seq
LECTURE 2
Identifying and Annotating Genomic Variation for Disease Gene Discovery
Overview/Objectives
• Genetic Variation
– Types
• Identifying Genetic Variation
– Methods
• Annotation of Genes and Variants
– Methods
– Sources
• Gene/Variant Prioritization
– Methods
Mapping Alone is Insufficient
Need Information on Variation
Why Identify Variants?
Why Identify Variants?
Types of Genetic Variation
Genetic Variation
• dbSNP (NCBI) build 142
– Catalogs Single Nucleotide Variants (SNV)
– 365 Million Submitted
– 113 Million Validated
– 54 Million in Genes
– 36 Million With Frequency in Populations
• 50-80% of mutations involved in inherited
disease caused by SNVs
– May be an overestimate due to lack of knowledge
SNP vs SNV
• Technically a polymorphism is a variation that
doesn’t cause disease and is common in a
population
• What is common?
– Greater than 5% in a population a typical
definition
– Definition for rare ranges from < 0.1% to < 1.0%
FREQUENCY OF GENETIC VARIANTS
Studies and Populations
Frequency of Polymorphisms:
Common vs Rare
• Mendelian disorders are caused by rare
variation, < 1% frequency in the relevant
population
• Leverage large projects aimed at assessing
genetic diversity in populations around the
world
1000 Genomes Project
Exome Sequencing Project
• Multi-Institutional
• Total possible patient pool of > 250,000
individuals, well phenotyped
– Includes healthy individuals and diseased
• Currently 6700 exomes sequenced
– 4420 European descent
– 2312 African American
• 1.2 million coding variations
– Most extremely rare/unique
– Many population specific
Other Resources and Projects
• Exome Aggregation Consortium: 60,000
Exomes
Other Resources and Projects
• Exome Aggregation Consortium: 60,000
Exomes
• Personal Genome Project (Ongoing)
• 100,000 Genomes Project (UK, Ongoing)
• BGI (Announced, China): 1 Million Genomes
• Precision Medicine Initiative (US, Announced):
1 Million Genomes
Population Matters
• Most variations in protein-coding genes
occurred fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes,
pathogen exposure and urban living
Human Populations
Population Matters
• Most variations in protein-coding genes occurred
fairly recently (last 20,000 years)
– Adaptation to agriculture and diet changes, pathogen
exposure and urban living
• Monogenic diseases have different prevalence in
different populations
– Cystic fibrosis in European population
– Hereditary Hemochromatosis in Northern Europeans
– Tay-Sachs in Ashkenazi Jews
– Sickle-Cell Anemia in Sub-Saharan African populations
DISCOVERING GENETIC VARIATION
Finding the Needles
Finding All Needles
SNPs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT
TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC
TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA
AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
INDELs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
reference genome
Finding All Needles
SNPs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT
TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC
TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA
AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
INDELs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
reference genome
All regions with mismatches are potential variants
Genotype Calling: Determining the
Type of Needle, The Absurdly Simple
Way (Stupid)
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC
GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
reference genome
Read depth at base: 10 T: 4 A: 6
Genotype: Heterozygous A/T
Genotype Calling: The Absurdly Simple
Way (Stupid)
• Doesn’t account for sequencing error
• Doesn’t account for sequencing bias
• Doesn’t count for bias in short-read mapping
process
• Doesn’t account for mapping error
• Doesn’t consider any external source of
information regarding populations or known
genetic variations
Genotype Calling: The Absurdly Simple
Way (Slightly less Stupid)
• Algorithm:
– Count all aligned bases that pass quality threshold
(e.g. >Q20)
– If #reads with alternative base > lower bound (20%)
and < upper bound (80%) call heterozygous alt
– Else if > upper bound call homozygous alternative
– Else call homozygous reference
• …But what about base qualities for more than
keeping reads?
Improving Genotype Calling: Local
Realignment
Duplicate Reads
Remove Duplicate Reads
What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
• Doesn’t account robustly for known sources of
error
What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
• Doesn’t account robustly for known sources of
error
• Doesn’t make use of any sources of external
information
What’s Missing
• No estimate of the confidence (stats) of
variant and genotype calls
• Doesn’t account robustly for known sources of
error
• Doesn’t make use of any sources of external
information
• Doesn’t include base qualities
Improving Genotype Calling: Bayes
Theorem
Improving Genotype Calling: Bayes
Theorem
Prior Probability
Error Model
All Possible Genotypes
Improved Genotype Calling: Prior
Probability
• Known Polymorphic Site?
– Allele Frequencies
• Global rate of polymorphisms
• Other samples
• Substitution Type
Substitution Type
• Transition:
– Purine to Purine (A to G)
– Pyrimidine to Pyrimidine (C to T)
• Transversion
– Purine to Pyrimidine
• Transition/Transversion ratio
– Transitions 2x as common (Genome Wide)
– 4x when looking only at exons
– Random Error: 0.5
Prior Probability Example
Assume:
Heterozygous SNP Rate of 0.001
Homozygous SNP Rate of 0.0005
Reference: G
Transition/Transversion Ratio: 2
Prior Probability Example
A C G T
A 3.33x10-4 1.11x10-7 6.67x10-4 1.11x10-7
C 8.33x10-5 1.67x10-4 2.78x10-8
G 0.9985 1.67x10-4
T 8.33x10-5
Assume:
Heterozygous SNP Rate of 0.001
Homozygous SNP Rate of 0.0005
Reference: G
Transition/Transversion Ratio: 2
Improved Genotype Calling: Error
Rates
Predicted Base
A C G T
Actual
Base
A - 57.7 17.1 25.2
C 34.9 - 11.3 53.9
G 31.9 5.1 - 63.0
T 45.9 22.1 32.0 -
If a base was miscalled, what is it most likely to be called
as instead?
Variant Calling
• SNP Calls infested with False Positives
– Machine artifacts
– Mis-mapped reads
– Mis-aligned indels
• 5 – 20% false positive rate
Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.
Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to
produce only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage
Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to produce
only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage
– Pro: Won’t miss real variants
– Con: Many more false positives
Decisions and Trade-Offs
• Option 1: Use stringent program options for
calling variants and hard filtering early to produce
only highly-confident call set.
– Pro: Few false positives
– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable)
options and filtering. Produce high-confidence
call set. Progressive filtering at later stage
– Con: False positives
– Pro: Won’t miss real variants
How Good Are My Calls?
• How many called SNPs?
– Human average of 1 heterozygous SNP / 1000
bases
• Fraction of variants already in dbSNP
– ~90%
• Transition/Transversion ratio
– Transitions 2x as common
• 3x when looking only at exons
ANNOTATING VARIANTS
Methods and Practices
Identifying Genetic Variation Causing
Genetic Disease
Discovering Genetic Variants Causing
Mendelian Disease
4 million genetic variants
2 million associated with
protein-coding genes
10,000 possibly
of disease
causing type
1500 <1%
frequency in
population
Single Causal
Genetic Variant
If a problem cannot be
solved, enlarge it.
--Dwight D. Eisenhower
Supreme Commander Allied Forces:
Second World War
34th President USA
Variant Annotation Pipeline Example
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Start
TAA
Stop
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Start
TAA
StopmRNA coding for protein
Splice Sites
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Patient
Start
TAA
StopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Patient
Start
TAA
StopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TAC
Tyr
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Patient
Start
TAA
StopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TAC
TyrSplice Site Loss
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Patient
Start
TAA
StopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TAC
TyrSplice Site Loss
Missense
Transcript Effects: Impact
Exon 1 Intron 1 Exon 2Reference
Patient
Start
TAA
StopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TAC
TyrSplice Site Loss
Missense/Frameshift Stop Gain
Predicting Pathogenicity
Example: SIFT Algorithm
Input Query
Sequence
Psi-BLAST
Homologs
Alignment
Multiple
Sequence
Alignment
Multiple
Sequence
Alignment
PSSM
Normalize
By most
frequent AA
Score
Prediction Take-Away
The more conserved a site is the more likely
any substitution is to be deleterious
However: Current methods have pretty poor
performance, not suitable for clinical-level
diagnosis
Variant Annotation Pipeline Example
Classifying Genetic Variants
4 million
variants
Intronic
Unknown Splice Site
Potential
Disease Causing
Exonic
Amino Acid
Changing
Known Genetic
Disease Variant
Stop Loss / Stop
Gain
Missense
Mutation
Known
Polymorphism in
Population
Silent Mutation Splice Site
Potential
Disease Causing
Intergenic
Visualization
GENE LEVEL ANNOTATION
Annotating Genes and Variants
• Is variant in a known protein-coding gene?
– What does the gene do?
– What molecular pathways?
– What protein-protein interactions?
– What tissues is it expressed in?
– When in development?
4 million genetic variants
2 million associated with
protein-coding genes
10,000 possibly
of disease
causing type
1500 <1%
frequency in
population
Gene Level Annotations
GENETIC REGIONS OF INTEREST
Identifying Genetic Regions of Interest
Identifying Genetic Regions of Interest
Number of Genes in Genomic Regions
of Interest
IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
represented population groups and sub-
groups…
IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic
diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-
represented population groups and sub-
groups…
– Acadians
– Native American
– Non-Acadian/European Descent
Population Frequency
• Mendelian disorders are rare
• If variation is in database, is it associated with
disease?
• Causal variation also needs to be rare
– Cutoff somewhere in the < 0.1 - < 1% range
– Should appear rarely or not at all in local controls
– Track with disease in family members under study
CASE STUDIES
IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa
IGNITE Data Pipeline and Integration
Mapped
Region(s)
Known Genes
Gene
Definitions
Pathway and
Interactions
Annotated
Genomic Variants
Filter
Sort
Prioritize
Gene
Annotations
Brain Calcification
Brain Calcification
• 84 genes in chromosome 5 region
• No likely homozygous or compound heterozygous
variants within region shared between two
patients
• 29 genes with at least one targeted region with
little or no sequencing coverage
• Many only lacked coverage in 5’ and 3’ UTRs
• Collaborators performed statistical tests for
possibly copy-number variations of targeted
regions using exome sequencing data
Brain Calcification
Charcot-Marie-Tooth: Genetic Mapping
Chromosome 9:
120,962,282 -133,033,431
Cutis Laxa: Genetic Mapping
Chromosome 17:
79,596,811-81,041,077
Charcot-Marie-Tooth Cutis Laxa
• 143 genes in region
• 13 known causative genes
– MPZ
– PMP22
– GDAP1
– KIF1B
– MFN2
– SOX
– EGR2
– DNM2
– RAB7
– LITAF (SIMPLE)
– GARS
– YARS
– LMNA
• 52 genes in region
• 5 known causative genes
– ATP6V0A2
– ELN
– FBLN5
– EFEMP2
– SCYL1BP1
– ALDH18A1
Gene Level Annotations
Pathway and Interaction Data
• 37 pathways
– Clathrin-derived vesicle
budding
– Lysosome vesicle
biogenesis
– Endocytosis
– Golgi-associated vesicle
biogenesis
– Membrane trafficking
– Trans-Golgi network vesicle
budding
• Primarily LMNA or DNM2
• 10 pathways
– Phagosome
– Collecting duct acid
secretion
– Lysosome
– Protein digestion and
absorption
– Metabolic pathways
– Oxidative phosphorylation
– Arginine and proline
metabolism
• Primarily ATP6V0A2
Simple Prioritization
Pathways and Protein-Protein
Interactions of Known Genes
Pathways and Protein-Protein
Interactions of Variant Genes
Results: Charcot-Marie-Tooth
• 8 Genes Prioritized
Gene Interactions Pathway
LRSAM1 Multiple Endocytosis
DNM1 DNM2 -
FNBP1 DNM2 -
TOR1A MNA -
STXBP1 Multiple Five
SH3GLB2 - Endocytosis
PIP5KL1 - Endocytosis
FAM125B - Endocytosis
• For more information
– Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
Results: Cutis Laxa
• 10 genes prioritized
Gene Interactions Pathway
HEXDC Multiple Phagosome
HG5 - Phagosome
HG5 Multiple Lysosome, Protein digestion
SIRT7 Multiple Metabolic Pathways
FASN - Metabolic Pathways
DCXR - Metabolic Pathways
PYCR1 - Metabolic Pathways,
Arginine/Proline
PCYT2 - Metabolic Pathways
ARHGDIA - Oxidative Phosphorylation
• For more information
– Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

More Related Content

What's hot

140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analysesGenomeInABottle
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisDespoina Kalfakakou
 
CRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MA
CRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MACRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MA
CRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MADiane McKenna
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Genome Editing Comes of Age
Genome Editing Comes of AgeGenome Editing Comes of Age
Genome Editing Comes of AgeCandy Smellie
 
The key considerations of crispr genome editing
The key considerations of crispr genome editingThe key considerations of crispr genome editing
The key considerations of crispr genome editingChris Thorne
 
Rewriting the Genome Using CRISPR and Synthetic Biology
Rewriting the Genome Using CRISPR and Synthetic Biology Rewriting the Genome Using CRISPR and Synthetic Biology
Rewriting the Genome Using CRISPR and Synthetic Biology Integrated DNA Technologies
 
Achieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographicAchieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographicQIAGEN
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim D. Pruitt
 
Monitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesMonitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesHealth Informatics New Zealand
 
Kim Pruitt biocuration2015
Kim Pruitt biocuration2015Kim Pruitt biocuration2015
Kim Pruitt biocuration2015Kim D. Pruitt
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomesGenomeInABottle
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
An Introduction to Crispr Genome Editing
An Introduction to Crispr Genome EditingAn Introduction to Crispr Genome Editing
An Introduction to Crispr Genome EditingChris Thorne
 
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Candy Smellie
 

What's hot (20)

171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
CRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MA
CRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MACRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MA
CRISPR Gene Editing Congress, 25-27 February 2015 in Boston, MA
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Genome Editing Comes of Age
Genome Editing Comes of AgeGenome Editing Comes of Age
Genome Editing Comes of Age
 
The key considerations of crispr genome editing
The key considerations of crispr genome editingThe key considerations of crispr genome editing
The key considerations of crispr genome editing
 
Rewriting the Genome Using CRISPR and Synthetic Biology
Rewriting the Genome Using CRISPR and Synthetic Biology Rewriting the Genome Using CRISPR and Synthetic Biology
Rewriting the Genome Using CRISPR and Synthetic Biology
 
Achieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographicAchieve improved variant detection in single cell sequencing infographic
Achieve improved variant detection in single cell sequencing infographic
 
Enriching Scholarship Personal Genomics presentation
Enriching Scholarship Personal Genomics presentationEnriching Scholarship Personal Genomics presentation
Enriching Scholarship Personal Genomics presentation
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
Monitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesMonitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomes
 
Kim Pruitt biocuration2015
Kim Pruitt biocuration2015Kim Pruitt biocuration2015
Kim Pruitt biocuration2015
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomes
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
An Introduction to Crispr Genome Editing
An Introduction to Crispr Genome EditingAn Introduction to Crispr Genome Editing
An Introduction to Crispr Genome Editing
 
PAG-2004-Roe
PAG-2004-RoePAG-2004-Roe
PAG-2004-Roe
 
Sept2016 sv 10_x
Sept2016 sv 10_xSept2016 sv 10_x
Sept2016 sv 10_x
 
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
 

Similar to 2015 Bioc4010 lecture1and2

Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRONPrabin Shakya
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...Torsten Seemann
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Nathan Olson
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for educationaryajayakottarathil
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshopGenomeInABottle
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Jane Landolin
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Thermo Fisher Scientific
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeBrian Krueger
 
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Thermo Fisher Scientific
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
Oncogenomics july 2012
Oncogenomics july 2012Oncogenomics july 2012
Oncogenomics july 2012Elsa von Licy
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptRuthMWinnie
 

Similar to 2015 Bioc4010 lecture1and2 (20)

BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp Leiden
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genome
 
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
Oncogenomics july 2012
Oncogenomics july 2012Oncogenomics july 2012
Oncogenomics july 2012
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
26072016 uc davis_small
26072016 uc davis_small26072016 uc davis_small
26072016 uc davis_small
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
 

More from Dan Gaston

Population and evolutionary genetics 1
Population and evolutionary genetics 1Population and evolutionary genetics 1
Population and evolutionary genetics 1Dan Gaston
 
2016 ngs health_lecture
2016 ngs health_lecture2016 ngs health_lecture
2016 ngs health_lectureDan Gaston
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary geneticsDan Gaston
 
2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine LectureDan Gaston
 
Bioc4700 2014 Guest Lecture
Bioc4700   2014 Guest LectureBioc4700   2014 Guest Lecture
Bioc4700 2014 Guest LectureDan Gaston
 
Protein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthProtein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthDan Gaston
 
Bioc4010 sample questions
Bioc4010 sample questionsBioc4010 sample questions
Bioc4010 sample questionsDan Gaston
 
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Dan Gaston
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Bioinformatics in Gene Research
Bioinformatics in Gene ResearchBioinformatics in Gene Research
Bioinformatics in Gene ResearchDan Gaston
 

More from Dan Gaston (10)

Population and evolutionary genetics 1
Population and evolutionary genetics 1Population and evolutionary genetics 1
Population and evolutionary genetics 1
 
2016 ngs health_lecture
2016 ngs health_lecture2016 ngs health_lecture
2016 ngs health_lecture
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 
2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture
 
Bioc4700 2014 Guest Lecture
Bioc4700   2014 Guest LectureBioc4700   2014 Guest Lecture
Bioc4700 2014 Guest Lecture
 
Protein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthProtein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human Health
 
Bioc4010 sample questions
Bioc4010 sample questionsBioc4010 sample questions
Bioc4010 sample questions
 
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Bioinformatics in Gene Research
Bioinformatics in Gene ResearchBioinformatics in Gene Research
Bioinformatics in Gene Research
 

Recently uploaded

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 

Recently uploaded (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

2015 Bioc4010 lecture1and2

  • 1. Next-Generation Sequence Analysis for Biomedical Applications BIOC 4010/5010 Lecture 1 Dr. Dan Gaston Postdoctoral Fellow Department of Pathology Dr. Karen Bedard Lab
  • 2. LECTURE 1 Introduction to Next-Gen Sequencing
  • 3. Overview: Lecture 1 • Why Next-Gen Sequencing Matters • What is Next-Gen Sequencing • Bioinformatics Workflows • Types of Next-Gen Experiments • Working with the Human Genome • Slides available on slideshare: – http://www.slideshare.net/DanGaston
  • 4.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Major Areas in Human Disease Genomics • Complex Disease – Genome Wide Association Studies (GWAS) • Mendelian Disease – Whole Genome/Exome Sequencing – Transcriptomics – Genetic Linkage – Sanger Sequencing • Cancer – Tumour Genomics – Transcriptomics
  • 13. Traditional Diagnosis of Genetic Disease • Genetic Counselors/Physicians order individual testing of genes based on patient phenotype • For rare diseases or unusual phenotypes may run tens to hundreds of tests • …..EXPENSIVE (Easily thousands of dollars)
  • 14. Next Generation Diagnosis of Genetic Disease • NGS-Based Targeted Sequencing Panels • Clinical Exome • Clinical Genome
  • 15. Genetic Disease Research: The Slow and Traditional Way in the Dark Ages (circa 2009)
  • 16. Genetic Disease Research: Cutis Laxa Chromosome 9: 120,962,282 -133,033,431
  • 17. Cutis Laxa • Linked Genomic Region ~13Mb in size • Contains 143 Genes • Prioritize and select genes for individual sanger sequencing • …Slow • …Laborious • …Can be expensive
  • 20. Human Genomics: More Power! • $5,000 - $10,000 to sequence whole genome – Dropping towards $1000 for sequencing only • ~$1000 to sequence only protein-coding portion (exome, later)
  • 21. Clinical Genomics • Rapid diagnosis of genetic disease in NICU cases • Quicker and cheaper than sequential genetic testing (traditional method)
  • 24. Personalized Medicine: Oncology Tumour Sample DNA Non-Tumour Sample DNA Databases and Annotations Sequence Tumour Specific Mutations Tumour Classification Drugs
  • 25.
  • 26.
  • 27. Personalized Medicine: Oncology Welch JS, et al. JAMA, 2011;305, 1577
  • 28. Personalized Medicine: Monitoring For Cancer Chemotherapy Resistance
  • 29. Composition of Human Genome Size: 3.2 Gb
  • 30. Genomic Content Chromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA 1 249,250,621 4,401,091 2,012 31 1,130 134 66 106 2 243,199,373 4,607,702 1,203 50 948 115 40 93 3 198,022,430 3,894,345 1,040 25 719 99 29 77 4 191,154,276 3,673,892 718 39 698 92 24 71 5 180,915,260 3,436,667 849 24 676 83 25 68 6 171,115,067 3,360,890 1,002 39 731 81 26 67 7 159,138,663 3,045,992 866 34 803 90 24 70 8 146,364,022 2,890,692 659 39 568 80 28 42 9 141,213,431 2,581,827 785 15 714 69 19 55 10 135,534,747 2,609,802 745 18 500 64 32 56 11 135,006,516 2,607,254 1,258 48 775 63 24 53 12 133,851,895 2,482,194 1,003 47 582 72 27 69 13 115,169,878 1,814,242 318 8 323 42 16 36 14 107,349,540 1,712,799 601 50 472 92 10 46 15 102,531,392 1,577,346 562 43 473 78 13 39 16 90,354,753 1,747,136 805 65 429 52 32 34 17 81,195,210 1,491,841 1,158 44 300 61 15 46 18 78,077,248 1,448,602 268 20 59 32 13 25 19 59,128,983 1,171,356 1,399 26 181 110 13 15 20 63,025,520 1,206,753 533 13 213 57 15 34 21 48,129,895 787,784 225 8 150 16 5 8 22 51,304,566 745,778 431 21 308 31 5 23 X 155,270,560 2,174,952 815 23 780 128 22 52 Y 59,373,566 286,812 45 8 327 15 7 2 mtDNA 16,569 929 13 0 0 0 2 22
  • 31. Exome and Genome Sequencing
  • 32. Short Reads Millions of “short reads” 75- 150bp each Usually “paired”
  • 34. FastQ Quality Scores Quality Score (Q) Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.90% 40 1 in 10000 99.99% 50 1 in 100000 100.00% Q = -10 log10 P
  • 35. Quality Scores of Sequencing Reads
  • 36. General Genomics Workflow Quality Control of Raw Data Raw Data Analysis Alignment to reference genome Whole Genome Mapping Detection of genetic variation (SNPs, Indels, SV) Variant Calling Linking variants to biological information Annotation
  • 37. Find the Location of Each Read in the Genome • Problems: – Short sequence
  • 38. Find the Location of Each Read in the Genome • Problems: – Short sequence – Millions of short sequences
  • 39. Find the Location of Each Read in the Genome • Problems: – Short sequence – Millions of short sequences – Big genome
  • 40. Find the Location of Each Read in the Genome • Problems: – Short sequence – Millions of short sequences – Big genome – Mismatches • Polymorphisms • Sequencing errors
  • 41. Find the Location of Each Read in the Genome • Problems: – Short sequence – Millions of short sequences – Big genome – Mismatches • Polymorphisms • Sequencing errors – Insertions and deletions
  • 42. Find the Location of Each Read in the Genome • Problems: – Short sequence – Millions of short sequences – Big genome – Mismatches • Polymorphisms • Sequencing errors – Insertions and deletions – May be processing many (100’s) of individuals
  • 43. Short Read Mapping …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGTATAC TAGGCTATA AGGCTATAT AGGCTATAT AGGCTATAT GGCTATATG CTATATGCG …CC …CC …CCA …CCA …CCAT ATAC… C… C… …CCAT 1) Report location of genome where read matches best 2) Minimize mismatches 3) Mismatches with lower quality bases better than mismatches with higher quality bases
  • 44. Short Read Mapping: Brute Force Method (Stupid) Simple conceptually: Compare each query k-mer to all k- mers of genome Scales with size of the genome and the reads (Not particularly well) Genome = AGCATGCTGCAGTCATGCTTAGGCTA Read = GCT
  • 45. Solution Index the Reference Genome Indexing the reference is like constructing a phone book, quickly move towards the relevant portion of the genome and ignore the rest.
  • 46. Suffix Array Split genome into all suffixes (substrings) and sort alphabetically Allows query to be searched against an alphabetical reference, skipping 96% of the genome Ex: banana$ banana$ $ anana$ a$ nana$ ana$ ana$ anana$ na$ banana$ a$ nana$ $ na$
  • 47. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11 Search for GATTACA…
  • 48. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11 Search for GATTACA…
  • 49. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11 Search for GATTACA…
  • 50. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11 Search for GATTACA…
  • 51. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos Pos 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11 Search for GATTACA…
  • 52. Binary Search • Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range • Repeat until done or empty range
  • 53. Applied to Human Genome • In practice simple methods of indexing the genome can create very large data structures – Suffix Array: > 12 GB • Solution: Apply complex procedures that allow you to index and compress the data: – Burrows-Wheeler Transform – FM-Index
  • 54. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$
  • 55. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$ BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA Circular Permutation
  • 56. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$ BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA Lexicographical Sort
  • 57. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$Burrows-Wheeler Matrix $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
  • 58. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$ $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
  • 59. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$ T(string) = ANNB$AA Transformed String: Compressible and Reversible $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
  • 60. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$ T(string) = ANNB$AA Suffix Array $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA 6 5 3 1 0 4 2
  • 61. Burrows-Wheeler Transform • Similar in many ways to creation of Suffix Array BANANA$ TT(string) = ANNB$AA FM-Index $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA 6 5 3 1 0 4 2 6, 5, 3, 1, 0, 4, 2 + + Character Count Tables
  • 62. Short Read Aligners • BLAT: BLAST-Like Alignment Tool • MAQ: First to take in to account quality scores • Bowtie: One of the first to use BWT, ungapped alignment only • BWA: One of the first to use BWT. First gapped BWT, incredibly fast and memory efficient • Bowtie2: Allows indels • SOAP, SOAP2: Also use BWT • … and many more
  • 63. Next-Gen Sequencing Experiments • Whole Genome Sequencing • Targeted Exome Sequencing • RNA-Seq • ChIP-Seq • CLIP-Seq
  • 65. Transcriptomics: RNA-Seq • Sequence the actively transcribed genes in a cell line or tissue – Only about 20% of genes are transcribed in particular cell types • Two types: – Poly-A selection – Total RNA + ribodepletion • Many experimental questions can be addressed
  • 69. RNA-Seq: Gene Fusion Events Exon1 Exon 2 Exon 3 Gene 2 Exon 4
  • 70. RNA-Seq • Important to take in to account biological variability. A sample of cells is a mixed population – Replicates! • Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA) • High false positive rates for fusion gene discovery, novel exons, when low expression levels
  • 73. LECTURE 2 Identifying and Annotating Genomic Variation for Disease Gene Discovery
  • 74. Overview/Objectives • Genetic Variation – Types • Identifying Genetic Variation – Methods • Annotation of Genes and Variants – Methods – Sources • Gene/Variant Prioritization – Methods
  • 75. Mapping Alone is Insufficient
  • 76. Need Information on Variation
  • 79. Types of Genetic Variation
  • 80. Genetic Variation • dbSNP (NCBI) build 142 – Catalogs Single Nucleotide Variants (SNV) – 365 Million Submitted – 113 Million Validated – 54 Million in Genes – 36 Million With Frequency in Populations • 50-80% of mutations involved in inherited disease caused by SNVs – May be an overestimate due to lack of knowledge
  • 81. SNP vs SNV • Technically a polymorphism is a variation that doesn’t cause disease and is common in a population • What is common? – Greater than 5% in a population a typical definition – Definition for rare ranges from < 0.1% to < 1.0%
  • 82. FREQUENCY OF GENETIC VARIANTS Studies and Populations
  • 83. Frequency of Polymorphisms: Common vs Rare • Mendelian disorders are caused by rare variation, < 1% frequency in the relevant population • Leverage large projects aimed at assessing genetic diversity in populations around the world
  • 85. Exome Sequencing Project • Multi-Institutional • Total possible patient pool of > 250,000 individuals, well phenotyped – Includes healthy individuals and diseased • Currently 6700 exomes sequenced – 4420 European descent – 2312 African American • 1.2 million coding variations – Most extremely rare/unique – Many population specific
  • 86. Other Resources and Projects • Exome Aggregation Consortium: 60,000 Exomes
  • 87. Other Resources and Projects • Exome Aggregation Consortium: 60,000 Exomes • Personal Genome Project (Ongoing) • 100,000 Genomes Project (UK, Ongoing) • BGI (Announced, China): 1 Million Genomes • Precision Medicine Initiative (US, Announced): 1 Million Genomes
  • 88. Population Matters • Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living
  • 90.
  • 91. Population Matters • Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living • Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary Hemochromatosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell Anemia in Sub-Saharan African populations
  • 93. Finding All Needles SNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG INDELs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome
  • 94. Finding All Needles SNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG INDELs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome All regions with mismatches are potential variants
  • 95. Genotype Calling: Determining the Type of Needle, The Absurdly Simple Way (Stupid) ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA reference genome Read depth at base: 10 T: 4 A: 6 Genotype: Heterozygous A/T
  • 96. Genotype Calling: The Absurdly Simple Way (Stupid) • Doesn’t account for sequencing error • Doesn’t account for sequencing bias • Doesn’t count for bias in short-read mapping process • Doesn’t account for mapping error • Doesn’t consider any external source of information regarding populations or known genetic variations
  • 97. Genotype Calling: The Absurdly Simple Way (Slightly less Stupid) • Algorithm: – Count all aligned bases that pass quality threshold (e.g. >Q20) – If #reads with alternative base > lower bound (20%) and < upper bound (80%) call heterozygous alt – Else if > upper bound call homozygous alternative – Else call homozygous reference • …But what about base qualities for more than keeping reads?
  • 98. Improving Genotype Calling: Local Realignment
  • 101. What’s Missing • No estimate of the confidence (stats) of variant and genotype calls
  • 102. What’s Missing • No estimate of the confidence (stats) of variant and genotype calls • Doesn’t account robustly for known sources of error
  • 103. What’s Missing • No estimate of the confidence (stats) of variant and genotype calls • Doesn’t account robustly for known sources of error • Doesn’t make use of any sources of external information
  • 104. What’s Missing • No estimate of the confidence (stats) of variant and genotype calls • Doesn’t account robustly for known sources of error • Doesn’t make use of any sources of external information • Doesn’t include base qualities
  • 105. Improving Genotype Calling: Bayes Theorem
  • 106. Improving Genotype Calling: Bayes Theorem Prior Probability Error Model All Possible Genotypes
  • 107. Improved Genotype Calling: Prior Probability • Known Polymorphic Site? – Allele Frequencies • Global rate of polymorphisms • Other samples • Substitution Type
  • 108. Substitution Type • Transition: – Purine to Purine (A to G) – Pyrimidine to Pyrimidine (C to T) • Transversion – Purine to Pyrimidine • Transition/Transversion ratio – Transitions 2x as common (Genome Wide) – 4x when looking only at exons – Random Error: 0.5
  • 109. Prior Probability Example Assume: Heterozygous SNP Rate of 0.001 Homozygous SNP Rate of 0.0005 Reference: G Transition/Transversion Ratio: 2
  • 110. Prior Probability Example A C G T A 3.33x10-4 1.11x10-7 6.67x10-4 1.11x10-7 C 8.33x10-5 1.67x10-4 2.78x10-8 G 0.9985 1.67x10-4 T 8.33x10-5 Assume: Heterozygous SNP Rate of 0.001 Homozygous SNP Rate of 0.0005 Reference: G Transition/Transversion Ratio: 2
  • 111. Improved Genotype Calling: Error Rates Predicted Base A C G T Actual Base A - 57.7 17.1 25.2 C 34.9 - 11.3 53.9 G 31.9 5.1 - 63.0 T 45.9 22.1 32.0 - If a base was miscalled, what is it most likely to be called as instead?
  • 112. Variant Calling • SNP Calls infested with False Positives – Machine artifacts – Mis-mapped reads – Mis-aligned indels • 5 – 20% false positive rate
  • 113. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.
  • 114. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants
  • 115. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants • Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage
  • 116. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants • Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Pro: Won’t miss real variants – Con: Many more false positives
  • 117. Decisions and Trade-Offs • Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants • Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Con: False positives – Pro: Won’t miss real variants
  • 118. How Good Are My Calls? • How many called SNPs? – Human average of 1 heterozygous SNP / 1000 bases • Fraction of variants already in dbSNP – ~90% • Transition/Transversion ratio – Transitions 2x as common • 3x when looking only at exons
  • 120. Identifying Genetic Variation Causing Genetic Disease
  • 121. Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population Single Causal Genetic Variant
  • 122. If a problem cannot be solved, enlarge it. --Dwight D. Eisenhower Supreme Commander Allied Forces: Second World War 34th President USA
  • 124. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Start TAA Stop
  • 125. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Start TAA StopmRNA coding for protein Splice Sites
  • 126. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Patient Start TAA StopmRNA coding for protein Exon 1 Intron 1 Exon 2 Splice Sites
  • 127. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Patient Start TAA StopmRNA coding for protein Exon 1 Intron 1 Exon 2 Splice Sites TAC Tyr
  • 128. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Patient Start TAA StopmRNA coding for protein Exon 1 Intron 1 Exon 2 Splice Sites TAC TyrSplice Site Loss
  • 129. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Patient Start TAA StopmRNA coding for protein Exon 1 Intron 1 Exon 2 Splice Sites TAC TyrSplice Site Loss Missense
  • 130. Transcript Effects: Impact Exon 1 Intron 1 Exon 2Reference Patient Start TAA StopmRNA coding for protein Exon 1 Intron 1 Exon 2 Splice Sites TAC TyrSplice Site Loss Missense/Frameshift Stop Gain
  • 132. Example: SIFT Algorithm Input Query Sequence Psi-BLAST Homologs Alignment Multiple Sequence Alignment Multiple Sequence Alignment PSSM Normalize By most frequent AA Score
  • 133. Prediction Take-Away The more conserved a site is the more likely any substitution is to be deleterious However: Current methods have pretty poor performance, not suitable for clinical-level diagnosis
  • 135. Classifying Genetic Variants 4 million variants Intronic Unknown Splice Site Potential Disease Causing Exonic Amino Acid Changing Known Genetic Disease Variant Stop Loss / Stop Gain Missense Mutation Known Polymorphism in Population Silent Mutation Splice Site Potential Disease Causing Intergenic
  • 138. Annotating Genes and Variants • Is variant in a known protein-coding gene? – What does the gene do? – What molecular pathways? – What protein-protein interactions? – What tissues is it expressed in? – When in development? 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  • 140. GENETIC REGIONS OF INTEREST
  • 143. Number of Genes in Genomic Regions of Interest
  • 144. IGNITE Project: Local Controls • IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada • Atlantic Canada harbours several non- represented population groups and sub- groups…
  • 145. IGNITE Project: Local Controls • IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada • Atlantic Canada harbours several non- represented population groups and sub- groups… – Acadians – Native American – Non-Acadian/European Descent
  • 146. Population Frequency • Mendelian disorders are rare • If variation is in database, is it associated with disease? • Causal variation also needs to be rare – Cutoff somewhere in the < 0.1 - < 1% range – Should appear rarely or not at all in local controls – Track with disease in family members under study
  • 147. CASE STUDIES IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa
  • 148. IGNITE Data Pipeline and Integration Mapped Region(s) Known Genes Gene Definitions Pathway and Interactions Annotated Genomic Variants Filter Sort Prioritize Gene Annotations
  • 150. Brain Calcification • 84 genes in chromosome 5 region • No likely homozygous or compound heterozygous variants within region shared between two patients • 29 genes with at least one targeted region with little or no sequencing coverage • Many only lacked coverage in 5’ and 3’ UTRs • Collaborators performed statistical tests for possibly copy-number variations of targeted regions using exome sequencing data
  • 152. Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 -133,033,431
  • 153. Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811-81,041,077
  • 154. Charcot-Marie-Tooth Cutis Laxa • 143 genes in region • 13 known causative genes – MPZ – PMP22 – GDAP1 – KIF1B – MFN2 – SOX – EGR2 – DNM2 – RAB7 – LITAF (SIMPLE) – GARS – YARS – LMNA • 52 genes in region • 5 known causative genes – ATP6V0A2 – ELN – FBLN5 – EFEMP2 – SCYL1BP1 – ALDH18A1
  • 156. Pathway and Interaction Data • 37 pathways – Clathrin-derived vesicle budding – Lysosome vesicle biogenesis – Endocytosis – Golgi-associated vesicle biogenesis – Membrane trafficking – Trans-Golgi network vesicle budding • Primarily LMNA or DNM2 • 10 pathways – Phagosome – Collecting duct acid secretion – Lysosome – Protein digestion and absorption – Metabolic pathways – Oxidative phosphorylation – Arginine and proline metabolism • Primarily ATP6V0A2
  • 157. Simple Prioritization Pathways and Protein-Protein Interactions of Known Genes Pathways and Protein-Protein Interactions of Variant Genes
  • 158. Results: Charcot-Marie-Tooth • 8 Genes Prioritized Gene Interactions Pathway LRSAM1 Multiple Endocytosis DNM1 DNM2 - FNBP1 DNM2 - TOR1A MNA - STXBP1 Multiple Five SH3GLB2 - Endocytosis PIP5KL1 - Endocytosis FAM125B - Endocytosis • For more information – Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  • 159. Results: Cutis Laxa • 10 genes prioritized Gene Interactions Pathway HEXDC Multiple Phagosome HG5 - Phagosome HG5 Multiple Lysosome, Protein digestion SIRT7 Multiple Metabolic Pathways FASN - Metabolic Pathways DCXR - Metabolic Pathways PYCR1 - Metabolic Pathways, Arginine/Proline PCYT2 - Metabolic Pathways ARHGDIA - Oxidative Phosphorylation • For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

Editor's Notes

  1. Create a visual workflow of NGS based precision medicine in Oncology