Next-Generation Sequence Analysis   for Biomedical Applications                BIOC 4010/5010                   Lecture 1 ...
Introduction to Next-Gen SequencingLECTURE 1
Overview: Lecture 1•   Introduction AKA “Why does this matter?”•   “Next-Gen” Sequencing•   Bioinformatics Workflows•   Ty...
Major Areas in Human Disease             Genomics• Complex diseases  – Genome Wide Association Studies (GWAS)• Cancer  – T...
Diagnosing Genetic Diseases• Genetic Counselors/Physicians order  individual testing of genes based on patient  phenotype•...
Genetic Disease Research
Genetic Disease Research: Cutis Laxa                       Chromosome 9:                       120,962,282 -133,033,431
Cutis Laxa• Linked Genomic Region ~13Mb in size• Contains 143 Genes• Prioritize and select genes for individual  sanger se...
Personalized Medicine
Human Genomics• $5,000 - $10,000 to sequence whole genome• $1000 to sequence only protein-coding  portion (exome, later)
Clinical Genomics• Rapid diagnosis of genetic disease in NICU cases• Quicker and cheaper than sequential genetic  testing ...
Cancer Genomics          Welch JS, et al. JAMA, 2011;305, 1577
Cancer Chemotherapy Resistance
Human Disease Genomics at Dalhousie• IGNITE: Identifying genetic mutations causing  rare mendelian diseases in Atlantic Ca...
Short ReadsMillions of paired “shortreads”, 75-150bp each
FastQ Format        Read ID                  Sequence                   Quality line
FastQ Quality ScoresQuality Score (Q)   Probability of incorrect base call   Base call accuracy       10                  ...
Quality Scores of Sequencing Reads
General Genomics Workflow  Raw Data        Quality Control of Raw  Analysis        DataWhole Genome      Alignment to refe...
Short Read Mapping    …CCAT   CTATATGCG       TCGGAAATT  CGGTATAC    …CCAT GGCTATATG     CTATCGGAAA    GCGGTATA    …CCA AG...
Discovering Genetic VariationSNPs       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ...
Next-Gen Sequencing Experiments•   Whole Genome Sequencing•   Targeted Exome Sequencing•   RNA-Seq•   ChIP-Seq•   CLIP-Seq
Next-Gen Sequencing Experiments•   Whole Genome Sequencing•   Targeted Exome Sequencing•   RNA-Seq•   ChIP-Seq•   CLIP-Seq
Composition of Human Genome                Size: 3.2 Gb
Genomic ContentChromosome   Base pairs    Variations   Confirmed proteins   Putative proteins   Pseudogenes   miRNA   rRNA...
Exome Sequencing
Transcriptomics: RNA-Seq• Sequence the actively transcribed genes in a  cell line or tissue  – Only about 20% of genes are...
RNA-Seq: Gene Expression         Condition 1         Condition 2
RNA-Seq: Differential SplicingExon1             Exon 2             Exon 3
RNA-Seq: Novel/Non-Canonical Exon            DiscoveryExon1          Exon 2   Exon X   Exon 3
RNA-Seq: Gene Fusion EventsExon1              Exon 2         Exon 3              Gene 2 Exon 4
RNA-Seq• Important to take in to account biological  variability. A sample of cells is a mixed population   – Replicates!•...
CHiP-Seq
CHiP-Seq
Short Read Mapping: Placing Millions      of Reads on Human Reference• Problem: Efficiently place millions of reads  (75bp...
Short Read Mapping: Brute Force                MethodSimple conceptually: Compare each query k-mer to all k-mers of genome...
Solution      Index the Reference GenomeIndexing the reference is like constructing a phonebook, quickly move towards the ...
Short Read Alignment: Suffix ArraySplit genome into all suffixes (substrings) and sortalphabeticallyAllows query to be sea...
Short Read Alignment: Binary Search • Searching the index efficiently is still a   problem…                       Index   ...
Short Read Alignment: Binary Search • Searching the index efficiently is still a   problem…                       Index   ...
Short Read Alignment: Binary Search • Searching the index efficiently is still a   problem…                       Index   ...
Short Read Alignment: Binary Search • Searching the index efficiently is still a   problem…                       Index   ...
Short Read Alignment: Binary Search • Searching the index efficiently is still a   problem…                       Index   ...
Binary Search• Initialize search range to entire list   – mid = (hi+lo)/2; middle = suffix[mid]   – if query matches middl...
Applied to Human Genome• In practice simple methods of indexing the  genome can create very large data structures  – Suffi...
Short Read Mapping: Mapping Quality• Have also ignored quality scores of reads• Mapping Quality (for a read): Sum the qual...
Short Read Aligners•   BLAT: BLAST-Like Alignment Tool•   MAQ: First to take in to account quality scores•   BWA: First to...
Identifying and Annotating Genomic Variation for Disease Gene DiscoveryLECTURE 2
Genetic Variation• dbSNP (NCBI) catalogues > 53 million Single  Nucleotide Variations (SNVs) in humans  – 38 million valid...
SNP vs SNV• Technically a polymorphism is a variation that  doesn’t cause disease and is common in a  population• What is ...
Discovering Genetic VariationSNPs       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA       ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ...
Variant Calling: The Absurdly Simple                 Way Read depth at base: 10                     T: 4                  ...
Variant Calling: The Absurdly Simple                   Way• Algorithm:  – Count all aligned bases that pass quality thresh...
Improving Variant Calling• MAQ (Mapping and Assembling with Quality):  – Short Read Mapper and Genotype Caller  – First to...
Improving Variant Calling① Base quality can not be more reliable than  mapping quality of read② At most individual can hav...
Improving Variant Calling• Three Possible Genotypes:  – AA, BB, AB• Construct a model that includes base quality  to estim...
The Model
The Model                  g = genotype    e = error probabilitym = ploidy (2)k = number of reads
The ModelReads that match   reference
The Model        Reads that don’t match              reference
Improving Variant Calling• Two widely used tool sets for calling variants  – samtools (uses MAQ-type calculation)  – Genom...
UnifiedGenotyperApply filters to discard poor reads and removebiases:  ① Duplicate reads  ② Malformed reads (i.e. mismatch...
Remove Duplicate ReadsApplication    Avg            Read Length   Avg          Molecules               #Molecules/Lib     ...
Sequencer-Specific Error Models  If a base was miscalled, what is it most likely to be called  as instead?                ...
Variant Calling• SNP Calls infested with False Positives  – Machine artifacts  – Mis-mapped reads  – Mis-aligned indels• 5...
Decisions and Trade-Offs• Option 1: Use stringent program options for  calling variants and hard filtering early to  produ...
Decisions and Trade-Offs• Option 1: Use stringent program options for  calling variants and hard filtering early to  produ...
Decisions and Trade-Offs• Option 1: Use stringent program options for  calling variants and hard filtering early to  produ...
Decisions and Trade-Offs• Option 1: Use stringent program options for  calling variants and hard filtering early to produc...
Decisions and Trade-Offs• Option 1: Use stringent program options for  calling variants and hard filtering early to produc...
How Good Are My Calls?• How many called SNPs?  – Human average of 1 heterozygous SNP / 1000    bases• Fraction of variants...
ANNOTATING VARIANTS
Identifying Genetic Variation Causing           Genetic Disease
Discovering Genetic Variants Causing         Mendelian Disease           4 million genetic variants           2 million as...
Discovering Genetic Variants Causing         Mendelian Disease           4 million genetic variants           2 million as...
If a problem cannot besolved, enlarge it.       --Dwight D. Eisenhower
TYPES OF SINGLE NUCLEOTIDEVARIANTS
Disease Genomics: Hunting Down         Pathogenic Genetic VariationReference       Exon 1   Intron 1   Exon 2        Start...
Disease Genomics: Hunting Down         Pathogenic Genetic Variation                              Splice SitesReference    ...
Disease Genomics: Hunting Down            Pathogenic Genetic Variation                                 Splice SitesReferen...
Disease Genomics: Hunting Down            Pathogenic Genetic Variation                                 Splice SitesReferen...
Disease Genomics: Hunting Down            Pathogenic Genetic Variation                                        Splice Sites...
Disease Genomics: Hunting Down            Pathogenic Genetic Variation                                          Splice Sit...
Disease Genomics: Hunting Down            Pathogenic Genetic Variation                                            Splice S...
GENETIC REGIONS OF INTEREST
Identifying Genetic Regions of Interest
Number of Genes in Genomic Regions            of Interest
FREQUENCY OF GENETIC VARIANTS
Frequency of Polymorphisms:          Common vs Rare• Mendelian disorders are caused by rare  variation, < 1-2% frequency i...
Human Populations
Population Matters• Most variations in protein-coding genes  occurred fairly recently (last 20,000 years)  – Adaptation to...
Population Matters• Most variations in protein-coding genes occurred  fairly recently (last 20,000 years)  – Adaptation to...
Population Matters• Most variations in protein-coding genes occurred  fairly recently (last 20,000 years)  – Adaptation to...
1000 Genomes Project
Exome Sequencing Project• Multi-Institutional• Total possible patient pool of > 250,000  individuals, well phenotyped  – I...
IGNITE Project: Local Controls• IGNITE: Tasked with studying rare monogenic  diseases identified in Atlantic Canada• Atlan...
IGNITE Project: Local Controls• IGNITE: Tasked with studying rare monogenic  diseases identified in Atlantic Canada• Atlan...
Population Frequency• Mendelian disorders are rare• If variation is in database, is it associated with  disease?• Causal v...
Predicting the Impact of Missense               Mutations• Most use some level of evolutionary  conservation to determine ...
Example: SIFT Algorithm                                                    MultipleInput Query                          Ho...
Predicting Impact• Other approaches include additional features:  – Protein structure information  – Site level annotation...
Prediction Take-AwayThe more conserved a site is the more likelyany substitution is to be deleteriousHowever: Current meth...
Classifying Genetic Variants                                      4 million                                      variants ...
GENE LEVEL ANNOTATION
Annotating Genes and Variants• Is variant in a known protein-coding gene?  – What does the gene do?                       ...
Gene Level Annotations
ADDING ANNOTATIONS TOVARIANTS
Genomic Intervals, Searching, and            Annotation• Most common way of describing genomic  features is as an interval...
Searching and Annotating: Interval                 Trees• Interval Trees allow efficient searching of all  overlapping int...
Interval TreesAll intervals to left                         All intervals to right                                 Node Co...
IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis LaxaCASE STUDIES
IGNITE Data Pipeline and Integration                 Gene              Annotations       Annotated                        ...
Brain Calcification
Brain Calcification• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous  variants within regio...
Brain Calcification
Charcot-Marie-Tooth: Genetic Mapping                      Chromosome 9:                      120,962,282 -133,033,431
Cutis Laxa: Genetic Mapping                Chromosome 17:                79,596,811-81,041,077
Charcot-Marie-Tooth Cutis Laxa• 143 genes in region        • 52 genes in region• 13 known causative genes   • 5 known caus...
Pathway and Interaction Data• 37 pathways                     • 10 pathways  – Clathrin-derived vesicle         – Phagosom...
Results: Charcot-Marie-Tooth• 8 Genes PrioritizedGene          Interactions PathwayLRSAM1    MultipleEndocytosisDNM1      ...
Results: Cutis Laxa• 10 genes prioritizedGene             Interactions PathwayHEXDC       Multiple       PhagosomeHG5     ...
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2
Upcoming SlideShare
Loading in...5
×

Bioc4010 lectures 1 and 2

594

Published on

Introduction to NGS and Bioinformatics for Human Disease Applications

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
594
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
51
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bioc4010 lectures 1 and 2

  1. 1. Next-Generation Sequence Analysis for Biomedical Applications BIOC 4010/5010 Lecture 1 Dr. Dan Gaston Postdoctoral Fellow Department of Pathology Dr. Karen Bedard Lab Bioinformatician, IGNITE Project
  2. 2. Introduction to Next-Gen SequencingLECTURE 1
  3. 3. Overview: Lecture 1• Introduction AKA “Why does this matter?”• “Next-Gen” Sequencing• Bioinformatics Workflows• Types of Next-Gen Experiments• Working with the Human Genome• Slides available on slideshare: – http://www.slideshare.net/DanGaston
  4. 4. Major Areas in Human Disease Genomics• Complex diseases – Genome Wide Association Studies (GWAS)• Cancer – Tumour genomics (Driver mutations) – Transcriptomics• Mendelian disease – Whole Genome/Exome Sequencing – Transcriptomics – Genetic Linkage
  5. 5. Diagnosing Genetic Diseases• Genetic Counselors/Physicians order individual testing of genes based on patient phenotype• For rare diseases or unusual phenotypes may run tens to hundreds of tests• …..EXPENSIVE (Easily thousands of dollars)
  6. 6. Genetic Disease Research
  7. 7. Genetic Disease Research: Cutis Laxa Chromosome 9: 120,962,282 -133,033,431
  8. 8. Cutis Laxa• Linked Genomic Region ~13Mb in size• Contains 143 Genes• Prioritize and select genes for individual sanger sequencing• …Slow• …Laborious• …Can be expensive
  9. 9. Personalized Medicine
  10. 10. Human Genomics• $5,000 - $10,000 to sequence whole genome• $1000 to sequence only protein-coding portion (exome, later)
  11. 11. Clinical Genomics• Rapid diagnosis of genetic disease in NICU cases• Quicker and cheaper than sequential genetic testing (traditional method)
  12. 12. Cancer Genomics Welch JS, et al. JAMA, 2011;305, 1577
  13. 13. Cancer Chemotherapy Resistance
  14. 14. Human Disease Genomics at Dalhousie• IGNITE: Identifying genetic mutations causing rare mendelian diseases in Atlantic Canada – 3 year, $2.5 million Genome Canada Project – Currently working on >10 different diseases including two inherited cancer’s – Sequenced >20 individual exomes, 4 whole genomes, and several transcriptomes – More on Thursday…• Dr. Graham Dellaire: Transcriptome sequencing and analysis on multiple cancer cell lines
  15. 15. Short ReadsMillions of paired “shortreads”, 75-150bp each
  16. 16. FastQ Format Read ID Sequence Quality line
  17. 17. FastQ Quality ScoresQuality Score (Q) Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.90% 40 1 in 10000 99.99% 50 1 in 100000 100.00% Q = -10 log10 P
  18. 18. Quality Scores of Sequencing Reads
  19. 19. General Genomics Workflow Raw Data Quality Control of Raw Analysis DataWhole Genome Alignment to reference Mapping genomeVariant Calling Detection of genetic variation (SNPs, Indels, SV) Linking variants to biological Annotation information
  20. 20. Short Read Mapping …CCAT CTATATGCG TCGGAAATT CGGTATAC …CCAT GGCTATATG CTATCGGAAA GCGGTATA …CCA AGGCTATAT CCTATCGGA TTGCGGTA C… …CCA AGGCTATAT GCCCTATCG TTTGCGGT C… …CC AGGCTATAT GCCCTATCG AAATTTGC ATAC… …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…1) Report location of genome where read matches best2) Minimize mismatches3) Mismatches with lower quality bases better than mismatches with higher quality bases
  21. 21. Discovering Genetic VariationSNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG INDELs
  22. 22. Next-Gen Sequencing Experiments• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq
  23. 23. Next-Gen Sequencing Experiments• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq
  24. 24. Composition of Human Genome Size: 3.2 Gb
  25. 25. Genomic ContentChromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA 1 249,250,621 4,401,091 2,012 31 1,130 134 66 106 2 243,199,373 4,607,702 1,203 50 948 115 40 93 3 198,022,430 3,894,345 1,040 25 719 99 29 77 4 191,154,276 3,673,892 718 39 698 92 24 71 5 180,915,260 3,436,667 849 24 676 83 25 68 6 171,115,067 3,360,890 1,002 39 731 81 26 67 7 159,138,663 3,045,992 866 34 803 90 24 70 8 146,364,022 2,890,692 659 39 568 80 28 42 9 141,213,431 2,581,827 785 15 714 69 19 55 10 135,534,747 2,609,802 745 18 500 64 32 56 11 135,006,516 2,607,254 1,258 48 775 63 24 53 12 133,851,895 2,482,194 1,003 47 582 72 27 69 13 115,169,878 1,814,242 318 8 323 42 16 36 14 107,349,540 1,712,799 601 50 472 92 10 46 15 102,531,392 1,577,346 562 43 473 78 13 39 16 90,354,753 1,747,136 805 65 429 52 32 34 17 81,195,210 1,491,841 1,158 44 300 61 15 46 18 78,077,248 1,448,602 268 20 59 32 13 25 19 59,128,983 1,171,356 1,399 26 181 110 13 15 20 63,025,520 1,206,753 533 13 213 57 15 34 21 48,129,895 787,784 225 8 150 16 5 8 22 51,304,566 745,778 431 21 308 31 5 23 X 155,270,560 2,174,952 815 23 780 128 22 52 Y 59,373,566 286,812 45 8 327 15 7 2 mtDNA 16,569 929 13 0 0 0 2 22
  26. 26. Exome Sequencing
  27. 27. Transcriptomics: RNA-Seq• Sequence the actively transcribed genes in a cell line or tissue – Only about 20% of genes are transcribed in particular cell types• Two types: – Poly-A selection – Total RNA + ribodepletion• Many experimental questions can be addressed
  28. 28. RNA-Seq: Gene Expression Condition 1 Condition 2
  29. 29. RNA-Seq: Differential SplicingExon1 Exon 2 Exon 3
  30. 30. RNA-Seq: Novel/Non-Canonical Exon DiscoveryExon1 Exon 2 Exon X Exon 3
  31. 31. RNA-Seq: Gene Fusion EventsExon1 Exon 2 Exon 3 Gene 2 Exon 4
  32. 32. RNA-Seq• Important to take in to account biological variability. A sample of cells is a mixed population – Replicates!• Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA)• High false positive rates for fusion gene discovery, novel exons, when low expression levels
  33. 33. CHiP-Seq
  34. 34. CHiP-Seq
  35. 35. Short Read Mapping: Placing Millions of Reads on Human Reference• Problem: Efficiently place millions of reads (75bp – 200bp) accurately within 3.2Gb of reference genome• Problem: Read may match equally well at more than one location (pseudogenes, copy number variation, repetititve elements)• Problem: Sequencing reads may be paired
  36. 36. Short Read Mapping: Brute Force MethodSimple conceptually: Compare each query k-mer to all k-mers of genomeGenome Size (N): 3.2 billion basesK-mer length (M): 7Number of comparisons((N-M + 1) * M): 21 billion
  37. 37. Solution Index the Reference GenomeIndexing the reference is like constructing a phonebook, quickly move towards the relevant portion of thegenome and ignore the rest.
  38. 38. Short Read Alignment: Suffix ArraySplit genome into all suffixes (substrings) and sortalphabeticallyAllows query to be searched against an alphabeticalreference, skipping 96% of the genomeEx: banana Sorted:banana aanana ananana ananaana bananana nanaa na
  39. 39. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  40. 40. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  41. 41. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  42. 42. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  43. 43. Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  44. 44. Binary Search• Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range• Repeat until done or empty range
  45. 45. Applied to Human Genome• In practice simple methods of indexing the genome can create very large data structures – Suffix Array: > 12 GB• Solution: Apply complex procedures that allow you to index and compress the data: – Burrows-Wheeler Transform – FM-Index
  46. 46. Short Read Mapping: Mapping Quality• Have also ignored quality scores of reads• Mapping Quality (for a read): Sum the quality scores at mismatched bases for alignment (SUM_BASE_Q(best)), also consider all other possible alignments MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10- SUM_BASE_Q(i))) )
  47. 47. Short Read Aligners• BLAT: BLAST-Like Alignment Tool• MAQ: First to take in to account quality scores• BWA: First to use Burrows-Wheeler Transform• Bowtie: Ungapped alignment only• Bowtie2: Allows indels• … and many more
  48. 48. Identifying and Annotating Genomic Variation for Disease Gene DiscoveryLECTURE 2
  49. 49. Genetic Variation• dbSNP (NCBI) catalogues > 53 million Single Nucleotide Variations (SNVs) in humans – 38 million validated – 22 million in genes – 36 million with frequencies• 50-80% of mutations involved in inherited disease caused by SNVs
  50. 50. SNP vs SNV• Technically a polymorphism is a variation that doesn’t cause disease and is common in a population• What is common? – Greater than 5% in a population a typical definition – Definition for rare ranges from < 0.5% to < 1.5%
  51. 51. Discovering Genetic VariationSNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG INDELs
  52. 52. Variant Calling: The Absurdly Simple Way Read depth at base: 10 T: 4 A: 6 Genotype: Heterozygous A/T TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome
  53. 53. Variant Calling: The Absurdly Simple Way• Algorithm: – Count all aligned bases that pass quality threshold (e.g. >Q20) – If #reads with alternative base > lower bound (20%) and < upper bound (80%) call heterozygous alt – Else if > upper bound call homozygous alternative – Else call homozygous reference• …But what about base qualities for more than keeping reads?
  54. 54. Improving Variant Calling• MAQ (Mapping and Assembling with Quality): – Short Read Mapper and Genotype Caller – First to use base qualities for either – Introduced mapping Quality
  55. 55. Improving Variant Calling① Base quality can not be more reliable than mapping quality of read② At most individual can have two real nucleotides at a position (two alleles) ① Only consider two most frequent nucleotides ② Simplify to two states: A and B
  56. 56. Improving Variant Calling• Three Possible Genotypes: – AA, BB, AB• Construct a model that includes base quality to estimate the probability of error• Calculate the probability of each genotype given the data and error rate• Genotype with highest probability is called
  57. 57. The Model
  58. 58. The Model g = genotype e = error probabilitym = ploidy (2)k = number of reads
  59. 59. The ModelReads that match reference
  60. 60. The Model Reads that don’t match reference
  61. 61. Improving Variant Calling• Two widely used tool sets for calling variants – samtools (uses MAQ-type calculation) – Genome Analysis Toolkit (GATK) UnifiedGenotyper• UnifiedGenotyper: Capable of calling both indels and single nucleotide polymorphisms (SNPs) and allele frequencies given multiple samples
  62. 62. UnifiedGenotyperApply filters to discard poor reads and removebiases: ① Duplicate reads ② Malformed reads (i.e. mismatch in #bases and base qualities) ③ Bad mate (paired-end sequencing, paired reads map to different chromosomes) ④ Mapping quality zero (maps to multiple locations equally well) ⑤ Fewer than 10% mismatch on read in 20bp to either side of position
  63. 63. Remove Duplicate ReadsApplication Avg Read Length Avg Molecules #Molecules/Lib #Molecules Sampled > 1 rary Sampled30X Genome 5bn 2x100 450m 4.4%4x Genome 5bn 2x100 60m 0.6%100x Exome 500m 2x75 20m 2.0%Duplicate reads break the assumption ofindependent sampling from the libraryIdentify reads with identical start/stop positions
  64. 64. Sequencer-Specific Error Models If a base was miscalled, what is it most likely to be called as instead? Predicted Base A C G T A - 57.7 17.1 25.2Actual C 34.9 - 11.3 53.9 Base G 31.9 5.1 - 63.0 T 45.9 22.1 32.0 -
  65. 65. Variant Calling• SNP Calls infested with False Positives – Machine artifacts – Mis-mapped reads – Mis-aligned indels• 5 – 20% false positive rate
  66. 66. Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.
  67. 67. Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants
  68. 68. Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage
  69. 69. Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Pro: Won’t miss real variants – Con: Many more false positives
  70. 70. Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Con: False positives – Pro: Won’t miss real variants
  71. 71. How Good Are My Calls?• How many called SNPs? – Human average of 1 heterozygous SNP / 1000 bases• Fraction of variants already in dbSNP• Transition/Transversion ratio – Transitions 2x as common • 2.8x when looking only at exons
  72. 72. ANNOTATING VARIANTS
  73. 73. Identifying Genetic Variation Causing Genetic Disease
  74. 74. Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  75. 75. Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in Single Causal population Genetic Variant
  76. 76. If a problem cannot besolved, enlarge it. --Dwight D. Eisenhower
  77. 77. TYPES OF SINGLE NUCLEOTIDEVARIANTS
  78. 78. Disease Genomics: Hunting Down Pathogenic Genetic VariationReference Exon 1 Intron 1 Exon 2 Start TAA Stop
  79. 79. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop
  80. 80. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein StopPatient Exon 1 Intron 1 Exon 2
  81. 81. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  82. 82. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss TyrPatient Exon 1 Intron 1 Exon 2
  83. 83. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss TyrPatient Exon 1 Intron 1 Exon 2 Missense
  84. 84. Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss TyrPatient Exon 1 Intron 1 Exon 2 Missense/Frameshift Stop Gain
  85. 85. GENETIC REGIONS OF INTEREST
  86. 86. Identifying Genetic Regions of Interest
  87. 87. Number of Genes in Genomic Regions of Interest
  88. 88. FREQUENCY OF GENETIC VARIANTS
  89. 89. Frequency of Polymorphisms: Common vs Rare• Mendelian disorders are caused by rare variation, < 1-2% frequency in the relevant population• Leverage large projects aimed at assessing genetic diversity in populations around the world – 1000 Genomes – NHLBI Exome Sequencing Project
  90. 90. Human Populations
  91. 91. Population Matters• Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living
  92. 92. Population Matters• Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living• Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary hemochromotosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell anemia in Sub-saharan Africa populations
  93. 93. Population Matters• Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living• Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary hemochromotosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell anemia in Sub-saharan Africa populations• Polygenic disorders
  94. 94. 1000 Genomes Project
  95. 95. Exome Sequencing Project• Multi-Institutional• Total possible patient pool of > 250,000 individuals, well phenotyped – Includes healthy individuals and diseased• Currently 6700 exomes sequenced – 4420 European descent – 2312 African American• 1.2 million coding variations – Most extremely rare/unique – Many population specific
  96. 96. IGNITE Project: Local Controls• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada• Atlantic Canada harbours several non- represented population groups and sub- groups…
  97. 97. IGNITE Project: Local Controls• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada• Atlantic Canada harbours several non- represented population groups and sub- groups… – Acadians – Native American – Non-Acadian/European Descent
  98. 98. Population Frequency• Mendelian disorders are rare• If variation is in database, is it associated with disease?• Causal variation also needs to be rare – Cutoff somewhere in the < 0.5 - < 1.5% range – Should appear rarely or not at all in local controls – Track with disease in family members under study
  99. 99. Predicting the Impact of Missense Mutations• Most use some level of evolutionary conservation to determine how severe a mutation is – SIFT – PolyPhen – GERP++ – EvoD
  100. 100. Example: SIFT Algorithm MultipleInput Query Homologs Sequence Sequence Alignment Psi-BLAST Alignment MultipleSequence PSSM ScoreAlignment Normalize By most frequent AA
  101. 101. Predicting Impact• Other approaches include additional features: – Protein structure information – Site level annotation (active sites, binding sites, etc) – Protein domain information – Biophysical properties of amino acids in that position and of the substituted amino acid
  102. 102. Prediction Take-AwayThe more conserved a site is the more likelyany substitution is to be deleteriousHowever: Current methods have pretty poorperformance, not suitable for clinical-leveldiagnosis
  103. 103. Classifying Genetic Variants 4 million variants Intronic Exonic Intergenic Amino Acid Unknown Splice Site Silent Mutation Splice Site Changing Potential Potential Disease Causing Disease Causing KnownKnown Genetic Stop Loss / Stop Missense Polymorphism inDisease Variant Gain Mutation Population
  104. 104. GENE LEVEL ANNOTATION
  105. 105. Annotating Genes and Variants• Is variant in a known protein-coding gene? – What does the gene do? 4 million genetic variants – What molecular pathways? 2 million associated with protein-coding genes – What protein-protein interactions? 10,000 possibly of disease – What tissues is it expressed in? causing type 1500 <1% frequency in population – When in development?
  106. 106. Gene Level Annotations
  107. 107. ADDING ANNOTATIONS TOVARIANTS
  108. 108. Genomic Intervals, Searching, and Annotation• Most common way of describing genomic features is as an interval• Multiple formats (BED, WIG, VCF, etc)• In common for all is location: – Chromosome – Start Position of Feature – End Position of Feature – Annotations/Info (Optional)
  109. 109. Searching and Annotating: Interval Trees• Interval Trees allow efficient searching of all overlapping intervals• Easiest to make one tree per chromosome• Given a set of intervals (n) on a number line (chromosome) construct a tree
  110. 110. Interval TreesAll intervals to left All intervals to right Node Contains: - Centre point - Intervals sorted by start - Intervals sorted by end
  111. 111. IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis LaxaCASE STUDIES
  112. 112. IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  113. 113. Brain Calcification
  114. 114. Brain Calcification• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous variants within region shared between two patients• 29 genes with at least one targeted region with little or no sequencing coverage• Many only lacked coverage in 5’ and 3’ UTRs• Collaborators performed statistical tests for possibly copy-number variations of targeted regions using exome sequencing data
  115. 115. Brain Calcification
  116. 116. Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 -133,033,431
  117. 117. Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811-81,041,077
  118. 118. Charcot-Marie-Tooth Cutis Laxa• 143 genes in region • 52 genes in region• 13 known causative genes • 5 known causative genes – MPZ – ATP6V0A2 – PMP22 – ELN – GDAP1 – FBLN5 – KIF1B – EFEMP2 – MFN2 – SCYL1BP1 – SOX – ALDH18A1 – EGR2 – DNM2 – RAB7 – LITAF (SIMPLE) – GARS – YARS – LMNA
  119. 119. Pathway and Interaction Data• 37 pathways • 10 pathways – Clathrin-derived vesicle – Phagosome budding – Collecting duct acid – Lysosome vesicle secretion biogenesis – Lysosome – Endocytosis – Protein digestion and – Golgi-associated vesicle absorption biogenesis – Metabolic pathways – Membrane trafficking – Oxidative phosphorylation – Trans-Golgi network vesicle – Arginine and proline budding metabolism• Primarily LMNA or DNM2 • Primarily ATP6V0A2
  120. 120. Results: Charcot-Marie-Tooth• 8 Genes PrioritizedGene Interactions PathwayLRSAM1 MultipleEndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis• For more information – Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  121. 121. Results: Cutis Laxa• 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation• For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×