Bioc4010 lectures 1 and 2
Upcoming SlideShare
Loading in...5
×
 

Bioc4010 lectures 1 and 2

on

  • 596 views

Introduction to NGS and Bioinformatics for Human Disease Applications

Introduction to NGS and Bioinformatics for Human Disease Applications

Statistics

Views

Total Views
596
Views on SlideShare
596
Embed Views
0

Actions

Likes
0
Downloads
29
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Bioc4010 lectures 1 and 2 Bioc4010 lectures 1 and 2 Presentation Transcript

  • Next-Generation Sequence Analysis for Biomedical Applications BIOC 4010/5010 Lecture 1 Dr. Dan Gaston Postdoctoral Fellow Department of Pathology Dr. Karen Bedard Lab Bioinformatician, IGNITE Project
  • Introduction to Next-Gen SequencingLECTURE 1
  • Overview: Lecture 1• Introduction AKA “Why does this matter?”• “Next-Gen” Sequencing• Bioinformatics Workflows• Types of Next-Gen Experiments• Working with the Human Genome• Slides available on slideshare: – http://www.slideshare.net/DanGaston
  • Major Areas in Human Disease Genomics• Complex diseases – Genome Wide Association Studies (GWAS)• Cancer – Tumour genomics (Driver mutations) – Transcriptomics• Mendelian disease – Whole Genome/Exome Sequencing – Transcriptomics – Genetic Linkage
  • Diagnosing Genetic Diseases• Genetic Counselors/Physicians order individual testing of genes based on patient phenotype• For rare diseases or unusual phenotypes may run tens to hundreds of tests• …..EXPENSIVE (Easily thousands of dollars)
  • Genetic Disease Research
  • Genetic Disease Research: Cutis Laxa Chromosome 9: 120,962,282 -133,033,431
  • Cutis Laxa• Linked Genomic Region ~13Mb in size• Contains 143 Genes• Prioritize and select genes for individual sanger sequencing• …Slow• …Laborious• …Can be expensive
  • Personalized Medicine
  • Human Genomics• $5,000 - $10,000 to sequence whole genome• $1000 to sequence only protein-coding portion (exome, later)
  • Clinical Genomics• Rapid diagnosis of genetic disease in NICU cases• Quicker and cheaper than sequential genetic testing (traditional method)
  • Cancer Genomics Welch JS, et al. JAMA, 2011;305, 1577
  • Cancer Chemotherapy Resistance
  • Human Disease Genomics at Dalhousie• IGNITE: Identifying genetic mutations causing rare mendelian diseases in Atlantic Canada – 3 year, $2.5 million Genome Canada Project – Currently working on >10 different diseases including two inherited cancer’s – Sequenced >20 individual exomes, 4 whole genomes, and several transcriptomes – More on Thursday…• Dr. Graham Dellaire: Transcriptome sequencing and analysis on multiple cancer cell lines
  • Short ReadsMillions of paired “shortreads”, 75-150bp each
  • FastQ Format Read ID Sequence Quality line
  • FastQ Quality ScoresQuality Score (Q) Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.90% 40 1 in 10000 99.99% 50 1 in 100000 100.00% Q = -10 log10 P
  • Quality Scores of Sequencing Reads
  • General Genomics Workflow Raw Data Quality Control of Raw Analysis DataWhole Genome Alignment to reference Mapping genomeVariant Calling Detection of genetic variation (SNPs, Indels, SV) Linking variants to biological Annotation information
  • Short Read Mapping …CCAT CTATATGCG TCGGAAATT CGGTATAC …CCAT GGCTATATG CTATCGGAAA GCGGTATA …CCA AGGCTATAT CCTATCGGA TTGCGGTA C… …CCA AGGCTATAT GCCCTATCG TTTGCGGT C… …CC AGGCTATAT GCCCTATCG AAATTTGC ATAC… …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…1) Report location of genome where read matches best2) Minimize mismatches3) Mismatches with lower quality bases better than mismatches with higher quality bases
  • Discovering Genetic VariationSNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG INDELs
  • Next-Gen Sequencing Experiments• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq
  • Next-Gen Sequencing Experiments• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq
  • Composition of Human Genome Size: 3.2 Gb
  • Genomic ContentChromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA 1 249,250,621 4,401,091 2,012 31 1,130 134 66 106 2 243,199,373 4,607,702 1,203 50 948 115 40 93 3 198,022,430 3,894,345 1,040 25 719 99 29 77 4 191,154,276 3,673,892 718 39 698 92 24 71 5 180,915,260 3,436,667 849 24 676 83 25 68 6 171,115,067 3,360,890 1,002 39 731 81 26 67 7 159,138,663 3,045,992 866 34 803 90 24 70 8 146,364,022 2,890,692 659 39 568 80 28 42 9 141,213,431 2,581,827 785 15 714 69 19 55 10 135,534,747 2,609,802 745 18 500 64 32 56 11 135,006,516 2,607,254 1,258 48 775 63 24 53 12 133,851,895 2,482,194 1,003 47 582 72 27 69 13 115,169,878 1,814,242 318 8 323 42 16 36 14 107,349,540 1,712,799 601 50 472 92 10 46 15 102,531,392 1,577,346 562 43 473 78 13 39 16 90,354,753 1,747,136 805 65 429 52 32 34 17 81,195,210 1,491,841 1,158 44 300 61 15 46 18 78,077,248 1,448,602 268 20 59 32 13 25 19 59,128,983 1,171,356 1,399 26 181 110 13 15 20 63,025,520 1,206,753 533 13 213 57 15 34 21 48,129,895 787,784 225 8 150 16 5 8 22 51,304,566 745,778 431 21 308 31 5 23 X 155,270,560 2,174,952 815 23 780 128 22 52 Y 59,373,566 286,812 45 8 327 15 7 2 mtDNA 16,569 929 13 0 0 0 2 22
  • Exome Sequencing
  • Transcriptomics: RNA-Seq• Sequence the actively transcribed genes in a cell line or tissue – Only about 20% of genes are transcribed in particular cell types• Two types: – Poly-A selection – Total RNA + ribodepletion• Many experimental questions can be addressed
  • RNA-Seq: Gene Expression Condition 1 Condition 2
  • RNA-Seq: Differential SplicingExon1 Exon 2 Exon 3
  • RNA-Seq: Novel/Non-Canonical Exon DiscoveryExon1 Exon 2 Exon X Exon 3
  • RNA-Seq: Gene Fusion EventsExon1 Exon 2 Exon 3 Gene 2 Exon 4
  • RNA-Seq• Important to take in to account biological variability. A sample of cells is a mixed population – Replicates!• Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA)• High false positive rates for fusion gene discovery, novel exons, when low expression levels
  • CHiP-Seq
  • CHiP-Seq
  • Short Read Mapping: Placing Millions of Reads on Human Reference• Problem: Efficiently place millions of reads (75bp – 200bp) accurately within 3.2Gb of reference genome• Problem: Read may match equally well at more than one location (pseudogenes, copy number variation, repetititve elements)• Problem: Sequencing reads may be paired
  • Short Read Mapping: Brute Force MethodSimple conceptually: Compare each query k-mer to all k-mers of genomeGenome Size (N): 3.2 billion basesK-mer length (M): 7Number of comparisons((N-M + 1) * M): 21 billion
  • Solution Index the Reference GenomeIndexing the reference is like constructing a phonebook, quickly move towards the relevant portion of thegenome and ignore the rest.
  • Short Read Alignment: Suffix ArraySplit genome into all suffixes (substrings) and sortalphabeticallyAllows query to be searched against an alphabeticalreference, skipping 96% of the genomeEx: banana Sorted:banana aanana ananana ananaana bananana nanaa na
  • Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • Short Read Alignment: Binary Search • Searching the index efficiently is still a problem… Index # Sequence Pos PosSearch for GATTACA… 1 ACAGATTACC… 6 2 ACC… 13 3 AGATTACC… 8 4 ATTACAGATTACC… 3 5 ATTACC… 10 6 C… 15 7 CAGATTACC… 7 8 CC… 14 9 GATTACAGATTACC… 2 10 GATTACC… 9 11 TACAGATTACC… 5 12 TACC… 12 13 TGATTACAGATTACC… 1 14 TTACAGATTACC… 4 15 TTACC… 11
  • Binary Search• Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range• Repeat until done or empty range
  • Applied to Human Genome• In practice simple methods of indexing the genome can create very large data structures – Suffix Array: > 12 GB• Solution: Apply complex procedures that allow you to index and compress the data: – Burrows-Wheeler Transform – FM-Index
  • Short Read Mapping: Mapping Quality• Have also ignored quality scores of reads• Mapping Quality (for a read): Sum the quality scores at mismatched bases for alignment (SUM_BASE_Q(best)), also consider all other possible alignments MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10- SUM_BASE_Q(i))) )
  • Short Read Aligners• BLAT: BLAST-Like Alignment Tool• MAQ: First to take in to account quality scores• BWA: First to use Burrows-Wheeler Transform• Bowtie: Ungapped alignment only• Bowtie2: Allows indels• … and many more
  • Identifying and Annotating Genomic Variation for Disease Gene DiscoveryLECTURE 2
  • Genetic Variation• dbSNP (NCBI) catalogues > 53 million Single Nucleotide Variations (SNVs) in humans – 38 million validated – 22 million in genes – 36 million with frequencies• 50-80% of mutations involved in inherited disease caused by SNVs
  • SNP vs SNV• Technically a polymorphism is a variation that doesn’t cause disease and is common in a population• What is common? – Greater than 5% in a population a typical definition – Definition for rare ranges from < 0.5% to < 1.5%
  • Discovering Genetic VariationSNPs ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGAT TCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG INDELs
  • Variant Calling: The Absurdly Simple Way Read depth at base: 10 T: 4 A: 6 Genotype: Heterozygous A/T TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA TCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG reference genome
  • Variant Calling: The Absurdly Simple Way• Algorithm: – Count all aligned bases that pass quality threshold (e.g. >Q20) – If #reads with alternative base > lower bound (20%) and < upper bound (80%) call heterozygous alt – Else if > upper bound call homozygous alternative – Else call homozygous reference• …But what about base qualities for more than keeping reads?
  • Improving Variant Calling• MAQ (Mapping and Assembling with Quality): – Short Read Mapper and Genotype Caller – First to use base qualities for either – Introduced mapping Quality
  • Improving Variant Calling① Base quality can not be more reliable than mapping quality of read② At most individual can have two real nucleotides at a position (two alleles) ① Only consider two most frequent nucleotides ② Simplify to two states: A and B
  • Improving Variant Calling• Three Possible Genotypes: – AA, BB, AB• Construct a model that includes base quality to estimate the probability of error• Calculate the probability of each genotype given the data and error rate• Genotype with highest probability is called
  • The Model
  • The Model g = genotype e = error probabilitym = ploidy (2)k = number of reads
  • The ModelReads that match reference
  • The Model Reads that don’t match reference
  • Improving Variant Calling• Two widely used tool sets for calling variants – samtools (uses MAQ-type calculation) – Genome Analysis Toolkit (GATK) UnifiedGenotyper• UnifiedGenotyper: Capable of calling both indels and single nucleotide polymorphisms (SNPs) and allele frequencies given multiple samples
  • UnifiedGenotyperApply filters to discard poor reads and removebiases: ① Duplicate reads ② Malformed reads (i.e. mismatch in #bases and base qualities) ③ Bad mate (paired-end sequencing, paired reads map to different chromosomes) ④ Mapping quality zero (maps to multiple locations equally well) ⑤ Fewer than 10% mismatch on read in 20bp to either side of position
  • Remove Duplicate ReadsApplication Avg Read Length Avg Molecules #Molecules/Lib #Molecules Sampled > 1 rary Sampled30X Genome 5bn 2x100 450m 4.4%4x Genome 5bn 2x100 60m 0.6%100x Exome 500m 2x75 20m 2.0%Duplicate reads break the assumption ofindependent sampling from the libraryIdentify reads with identical start/stop positions
  • Sequencer-Specific Error Models If a base was miscalled, what is it most likely to be called as instead? Predicted Base A C G T A - 57.7 17.1 25.2Actual C 34.9 - 11.3 53.9 Base G 31.9 5.1 - 63.0 T 45.9 22.1 32.0 -
  • Variant Calling• SNP Calls infested with False Positives – Machine artifacts – Mis-mapped reads – Mis-aligned indels• 5 – 20% false positive rate
  • Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.
  • Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants
  • Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage
  • Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Pro: Won’t miss real variants – Con: Many more false positives
  • Decisions and Trade-Offs• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set. – Pro: Few false positives – Con: Will miss real variants• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage – Con: False positives – Pro: Won’t miss real variants
  • How Good Are My Calls?• How many called SNPs? – Human average of 1 heterozygous SNP / 1000 bases• Fraction of variants already in dbSNP• Transition/Transversion ratio – Transitions 2x as common • 2.8x when looking only at exons
  • ANNOTATING VARIANTS
  • Identifying Genetic Variation Causing Genetic Disease
  • Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in population
  • Discovering Genetic Variants Causing Mendelian Disease 4 million genetic variants 2 million associated with protein-coding genes 10,000 possibly of disease causing type 1500 <1% frequency in Single Causal population Genetic Variant
  • If a problem cannot besolved, enlarge it. --Dwight D. Eisenhower
  • TYPES OF SINGLE NUCLEOTIDEVARIANTS
  • Disease Genomics: Hunting Down Pathogenic Genetic VariationReference Exon 1 Intron 1 Exon 2 Start TAA Stop
  • Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop
  • Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein StopPatient Exon 1 Intron 1 Exon 2
  • Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC TyrPatient Exon 1 Intron 1 Exon 2
  • Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss TyrPatient Exon 1 Intron 1 Exon 2
  • Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss TyrPatient Exon 1 Intron 1 Exon 2 Missense
  • Disease Genomics: Hunting Down Pathogenic Genetic Variation Splice SitesReference Exon 1 Intron 1 Exon 2 Start TAA mRNA coding for protein Stop TAC Splice Site Loss TyrPatient Exon 1 Intron 1 Exon 2 Missense/Frameshift Stop Gain
  • GENETIC REGIONS OF INTEREST
  • Identifying Genetic Regions of Interest
  • Number of Genes in Genomic Regions of Interest
  • FREQUENCY OF GENETIC VARIANTS
  • Frequency of Polymorphisms: Common vs Rare• Mendelian disorders are caused by rare variation, < 1-2% frequency in the relevant population• Leverage large projects aimed at assessing genetic diversity in populations around the world – 1000 Genomes – NHLBI Exome Sequencing Project
  • Human Populations
  • Population Matters• Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living
  • Population Matters• Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living• Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary hemochromotosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell anemia in Sub-saharan Africa populations
  • Population Matters• Most variations in protein-coding genes occurred fairly recently (last 20,000 years) – Adaptation to agriculture and diet changes, pathogen exposure and urban living• Monogenic diseases have different prevalence in different populations – Cystic fibrosis in European population – Hereditary hemochromotosis in Northern Europeans – Tay-Sachs in Ashkenazi Jews – Sickle-Cell anemia in Sub-saharan Africa populations• Polygenic disorders
  • 1000 Genomes Project
  • Exome Sequencing Project• Multi-Institutional• Total possible patient pool of > 250,000 individuals, well phenotyped – Includes healthy individuals and diseased• Currently 6700 exomes sequenced – 4420 European descent – 2312 African American• 1.2 million coding variations – Most extremely rare/unique – Many population specific
  • IGNITE Project: Local Controls• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada• Atlantic Canada harbours several non- represented population groups and sub- groups…
  • IGNITE Project: Local Controls• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada• Atlantic Canada harbours several non- represented population groups and sub- groups… – Acadians – Native American – Non-Acadian/European Descent
  • Population Frequency• Mendelian disorders are rare• If variation is in database, is it associated with disease?• Causal variation also needs to be rare – Cutoff somewhere in the < 0.5 - < 1.5% range – Should appear rarely or not at all in local controls – Track with disease in family members under study
  • Predicting the Impact of Missense Mutations• Most use some level of evolutionary conservation to determine how severe a mutation is – SIFT – PolyPhen – GERP++ – EvoD
  • Example: SIFT Algorithm MultipleInput Query Homologs Sequence Sequence Alignment Psi-BLAST Alignment MultipleSequence PSSM ScoreAlignment Normalize By most frequent AA
  • Predicting Impact• Other approaches include additional features: – Protein structure information – Site level annotation (active sites, binding sites, etc) – Protein domain information – Biophysical properties of amino acids in that position and of the substituted amino acid
  • Prediction Take-AwayThe more conserved a site is the more likelyany substitution is to be deleteriousHowever: Current methods have pretty poorperformance, not suitable for clinical-leveldiagnosis
  • Classifying Genetic Variants 4 million variants Intronic Exonic Intergenic Amino Acid Unknown Splice Site Silent Mutation Splice Site Changing Potential Potential Disease Causing Disease Causing KnownKnown Genetic Stop Loss / Stop Missense Polymorphism inDisease Variant Gain Mutation Population
  • GENE LEVEL ANNOTATION
  • Annotating Genes and Variants• Is variant in a known protein-coding gene? – What does the gene do? 4 million genetic variants – What molecular pathways? 2 million associated with protein-coding genes – What protein-protein interactions? 10,000 possibly of disease – What tissues is it expressed in? causing type 1500 <1% frequency in population – When in development?
  • Gene Level Annotations
  • ADDING ANNOTATIONS TOVARIANTS
  • Genomic Intervals, Searching, and Annotation• Most common way of describing genomic features is as an interval• Multiple formats (BED, WIG, VCF, etc)• In common for all is location: – Chromosome – Start Position of Feature – End Position of Feature – Annotations/Info (Optional)
  • Searching and Annotating: Interval Trees• Interval Trees allow efficient searching of all overlapping intervals• Easiest to make one tree per chromosome• Given a set of intervals (n) on a number line (chromosome) construct a tree
  • Interval TreesAll intervals to left All intervals to right Node Contains: - Centre point - Intervals sorted by start - Intervals sorted by end
  • IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis LaxaCASE STUDIES
  • IGNITE Data Pipeline and Integration Gene Annotations Annotated Genomic Variants Mapped Gene Region(s) Definitions Filter Sort PrioritizeKnown Genes Pathway and Interactions
  • Brain Calcification
  • Brain Calcification• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous variants within region shared between two patients• 29 genes with at least one targeted region with little or no sequencing coverage• Many only lacked coverage in 5’ and 3’ UTRs• Collaborators performed statistical tests for possibly copy-number variations of targeted regions using exome sequencing data
  • Brain Calcification
  • Charcot-Marie-Tooth: Genetic Mapping Chromosome 9: 120,962,282 -133,033,431
  • Cutis Laxa: Genetic Mapping Chromosome 17: 79,596,811-81,041,077
  • Charcot-Marie-Tooth Cutis Laxa• 143 genes in region • 52 genes in region• 13 known causative genes • 5 known causative genes – MPZ – ATP6V0A2 – PMP22 – ELN – GDAP1 – FBLN5 – KIF1B – EFEMP2 – MFN2 – SCYL1BP1 – SOX – ALDH18A1 – EGR2 – DNM2 – RAB7 – LITAF (SIMPLE) – GARS – YARS – LMNA
  • Pathway and Interaction Data• 37 pathways • 10 pathways – Clathrin-derived vesicle – Phagosome budding – Collecting duct acid – Lysosome vesicle secretion biogenesis – Lysosome – Endocytosis – Protein digestion and – Golgi-associated vesicle absorption biogenesis – Metabolic pathways – Membrane trafficking – Oxidative phosphorylation – Trans-Golgi network vesicle – Arginine and proline budding metabolism• Primarily LMNA or DNM2 • Primarily ATP6V0A2
  • Results: Charcot-Marie-Tooth• 8 Genes PrioritizedGene Interactions PathwayLRSAM1 MultipleEndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis• For more information – Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
  • Results: Cutis Laxa• 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation• For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9