Sifting the human genome for functional polymorphisms Pauline C. Ng, PhD
From genotype to phenotype humans are ~99.9% identical to each other genetic variation causes different phenotypes
Variation around genes are most likely to contribute to phenotype Coding Nonsynonymous SNPs, variation that causes an amino acid substitution 3’UTR Change in  protein  function? 5’UTR upstream 5’UTR
Amino acid substitutions can cause disease gene lesions responsible for disease :  aa substitutions  ~50%    (Human Mutation 15:45-51) Hemoglobin    E6V     sickle-cell anemia
1 SNP / 1000 bp Protein: 1:1 synonymous:nonsynonymous 1:2 expected 1:1 conservative:nonconservative 1:2 expected Nat. Genetics  22:231-238, 22:239-247 Science  293:489-93 nsSNPs in humans are selected against ? some of the observed nsSNPs may be involved in disease
Predicting the effect of an amino acid substitution Applications Nonsynonymous SNPs Large-scale random mutagenesis projects Cheap and quick for suggesting experiments
Computational Tools for Predicting AA Substitution Effects 1)  SIFT  ( s orts  i ntolerant  f rom  t olerant)    uses sequence Genome Research  11:863-574   12:436-446 2) EMBL  uses structure + sequence + annotation    Human Mol. Gen.  10:591-597 3) Variagenics  uses structure + sequence    J. Mol. Biol.  307:683-706
Sequence conservation correlated with intolerance to substitutions Conservation    log 2 20 +   f aa log f aa
SIFT Choosing sequences a) Database search b) Choose closely  related sequences Obtain  alignment with related proteins. For each position, calculate scaled probabilities for each amino acid substitution. Query protein < cutoff > cutoff tolerated affects function
SIFT: Choosing sequences # of sequences:  1 2 3 4 5 6 7 8 9 10 11 12 13 14
SIFT: Calculating probabilities 1 0 2 0 1 0 4 0 1 0 1 0 1 0 1 0 1 0 2 0 3 0 1 0 p x /p max  < 0.05 => x  affects function  20 12 4 1 20 16 13 9 4 2 12 0 9 7 18 13 19 12 16 11 c 20 14 c 13 9 c 16 10 16 9 c 5 2 13 7 12 8 12 8
SIFT output Substitution Probability  Prediction Confidence M24S    0.04 Affect Function   Low S82T   0.36 Tolerated   High V247A    0.03 Affect Function    High !!!
Confidence is determined by the diversity of sequences in the alignment many highly  identical sequences Ideal case:  Diverse set of orthologous proteins few sequences available Low confidence examples
Case Study: LacI lac operon repressed LacI expressed lactose present normal state 4000 single amino acid substitutions assayed: throughout entire protein both neutral and affected phenotypes TIBS  22:334-339 c c
Prediction on LacI substitutions 63% 28% Substitutions that affect protein function Substitutions that give no phenotype Total prediction accuracy 68% (2726/4004) Pr(observe affected phenotype  | predicted to be damaging) 63% false - false + 37% 72% predicted to affect function predicted to be tolerated 37%
False negative error:  Positions not conserved among paralogues dimer & sugar interface not conserved
False positive error in LacI: surface with unknown function?
SIFTing human variant databases 69% 25% Substitutions involved in disease 7397 subst., 606 proteins  from SWISS-PROT Predicted on 76% proteins 71% subst nsSNPs in  normal individuals 19% Putative polymorphisms 5780 nsSNPs, 3005 proteins from dbSNP Predicted on 60% prot.,  53% subst. 185 nsSNPs, 69 proteins from Whitehead Institute Predicted on 77% prot. 62% subst 31% 81% 75%
On functionally neutral substitutions, expected false positive error  ~20% dbSNP nsSNPs in normal individuals Whitehead Institute Putative polymorphisms suggests that most nsSNPs are  functionally neutral What accounts for the 5% difference? 25% 19%
Account for 5% difference in dbSNP 16 genes with a high fraction of dbSNP variants predicted to affect function 1)  Substitutions found in patients  2)  Substitutions mapped to nonfunctional genes/regions 3)  Substitutions detected in error Supports SIFT as a prediction tool
Account for 5% difference in dbSNP 16 genes with a high fraction of dbSNP variants predicted to affect function 1)  Substitutions found in patients   2)  Substitutions mapped to nonfunctional genes/regions 3)  Substitutions detected in error Supports SIFT as a prediction tool
Mutations in  MSHR  increase skin cancer Mutations associated with cutaneous malignant melanoma 1 Mutations  not  associated with CMM 1-3 1   Am. J. Hum. Genet . 66: 176-186,  2   J. Invest. Dermatol . 116 :224-229,  3   J. Invest. Dermatol . 112: 512-513 R151C L60V    R151C   D294H   R160W  Tolerated Affect  function Prediction Substitution  L60V  R163Q  D84E Tolerated Affect  function Prediction Substitution
Mutations in  PPAR  ,  a candidate gene for diabetes *** In diabetics and controls, but increases cholesterol levels in diabetics and  perhaps nondiabetics 2-4 SIFT will detect what has been selected against in evolution;  inappropriate assay may fail to detect 1 Am. J. Hum. Genet . 63:abs997  2 Diabetologia  43:673-680  3 Diabetes Metab . 26:393-401  4 J.Lipid Res.  41: 945-952  5 J. Hum. Genet . 46: 285-288 Mutations in diabetics 1 Mutations in nondiabetics 1-5    R127Q   R409T   D304N  Tolerated Affect  function Prediction Substitution  V227A  A268V  *** L162V Tolerated Affect  function  Prediction Substitution
Mutations in  MTHFR Mutations with diminished enzyme activity 1-5 Unknown effect Common Under balancing selection Increases neural tube defects Reduce risk for some types of leukemia Found by contig comparison 1 Nat. Genet . 10:111-113  2 PNAS  96:12810-12815  3 PNAS  98:4004-4009  4 Cancer Res . 57:1098-1102 5 Mol. Genet. Metab . 64: 169-172    A222V  E429A Tolerated Affect function Prediction Substitution  R68Q
dbSNP variants from patients Can distinguish patients from controls Individuals with disease:  18/22 predicted to be damaging Control individuals:  9/10 predicted to be functionally neutral SIFT detects what’s selected against in evolution & is independent of assay Example: PPAR   Detect substitutions that are deleterious in the context of the protein, not the organism Can detect nsSNPs with minor effects on phenotype genes increase risk of  skin cancer, diabetes, cholesterol levels   The protein need not be essential because SIFT predicts on the substitution.  Can detect nsSNPs under balancing selection Example :  MTHFR
16 genes with a high fraction of dbSNP variants predicted to affect function 1)  Substitutions found in patients  2)  Substitutions mapped to nonfunctional genes or    regions 3)  Substitutions detected in error
16 genes with a high fraction of dbSNP variants predicted to affect function 1)  Substitutions found in patients  2)  Substitutions mapped to nonfunctional genes or    regions 3)  Substitutions detected in error
16 genes with a high fraction of dbSNP variants predicted to affect function 1)  Substitutions found in patients 2)  Substitutions mapped to nonfunctional genes/regions 3)   Substitutions detected in error   Changes found in patients Confirms SIFT prediction and its sensitivity Unlikely to affect human health Irrelevant to human health
Comparison of Prediction Tools 69% 69% 63% 75% 28% 9% 25% 32% 15% 19% Variagenics SIFT SIFT EMBL disease  subst. LacI Variagenics SIFT LacI EMBL* 15% Variagenics SIFT SIFT EMBL SNP  databases normal individuals Substitutions that affect function Substitutions that do not affect function Polymorphisms 31% 72% 69% 91% 75% 68% 81% 85% SIFT has similar prediction accuracy to tools that use structure
http:// blocks.fhcrc.org/sift/SIFT.html SIFT, a prediction tool for the effect of substitutions prediction is based only on sequence Detect damaging nsSNPs on a large scale
Association studies  for finding disease loci   Direct approach SNPs likely to affect gene function Association leads directly to candidate gene Fewer SNPs to genotype Indirect approach using haplotypes tagSNPs to identify common haplotypes  in a region Relies on LD with causal variant Genotype 200K-1 million SNPs AATACGAT AATACGAT AATACGAT GATACAAC GATACAAC GATACAAC
Feasibility of direct approach Have we identified all the causative variants? common variant, common disease hypothesis 80% of common SNPs in Europeans in dbSNP, 50% of common SNPs in Africans.  Nat Genet.  33:518-21 What types of variants are involved in disease? nsSNPs & splicing variants account for a large proportion of Mendelian disease regulatory variation has a role in disease ~50% genes show allele-specific expression  Science   297 :1143;  Hum. Genet.   113: 149–153  ~1/3 of promoter variants may alter gene expression  Hum. Mol. Genet.   12: 2249–2254
nsSNPs SNPs near intron/exon boundary UTRs and promoter region  4.  synonymous SNPs  Possible effect? In LD with causative variant? SNPs in and near genes protein  function splicing regulation
covers 20,024 genes Double-hit and known-frequency SNPs in genes
Non-genic regions could potentially harbor disease variants ~70% of bases in conserved sequences are noncoding (Genome Res. 13:2507-18 ) regulatory elements noncoding RNAs unknown genes  41,193 SNPs in noncoding conserved regions   >= 80% identity with mouse
Adding SNPs in conserved regions improves SNP density
Focusing on variation in functional regions Large # of SNPs makes direct approach possible If causative variant is not in set, may be in LD with another SNP in the functional region Concentrating on functional regions allows interesting experiments genotyping DNA copy number allele-expression differences Complementary to the indirect approach using haplotypes
Acknowledgments FHCRC Steve Henikoff Jorja Henikoff Henikoff Lab

testing123

  • 1.
    Sifting the humangenome for functional polymorphisms Pauline C. Ng, PhD
  • 2.
    From genotype tophenotype humans are ~99.9% identical to each other genetic variation causes different phenotypes
  • 3.
    Variation around genesare most likely to contribute to phenotype Coding Nonsynonymous SNPs, variation that causes an amino acid substitution 3’UTR Change in protein function? 5’UTR upstream 5’UTR
  • 4.
    Amino acid substitutionscan cause disease gene lesions responsible for disease : aa substitutions ~50% (Human Mutation 15:45-51) Hemoglobin  E6V  sickle-cell anemia
  • 5.
    1 SNP /1000 bp Protein: 1:1 synonymous:nonsynonymous 1:2 expected 1:1 conservative:nonconservative 1:2 expected Nat. Genetics 22:231-238, 22:239-247 Science 293:489-93 nsSNPs in humans are selected against ? some of the observed nsSNPs may be involved in disease
  • 6.
    Predicting the effectof an amino acid substitution Applications Nonsynonymous SNPs Large-scale random mutagenesis projects Cheap and quick for suggesting experiments
  • 7.
    Computational Tools forPredicting AA Substitution Effects 1) SIFT ( s orts i ntolerant f rom t olerant) uses sequence Genome Research 11:863-574 12:436-446 2) EMBL uses structure + sequence + annotation Human Mol. Gen. 10:591-597 3) Variagenics uses structure + sequence J. Mol. Biol. 307:683-706
  • 8.
    Sequence conservation correlatedwith intolerance to substitutions Conservation  log 2 20 +  f aa log f aa
  • 9.
    SIFT Choosing sequencesa) Database search b) Choose closely related sequences Obtain alignment with related proteins. For each position, calculate scaled probabilities for each amino acid substitution. Query protein < cutoff > cutoff tolerated affects function
  • 10.
    SIFT: Choosing sequences# of sequences: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • 11.
    SIFT: Calculating probabilities1 0 2 0 1 0 4 0 1 0 1 0 1 0 1 0 1 0 2 0 3 0 1 0 p x /p max < 0.05 => x affects function 20 12 4 1 20 16 13 9 4 2 12 0 9 7 18 13 19 12 16 11 c 20 14 c 13 9 c 16 10 16 9 c 5 2 13 7 12 8 12 8
  • 12.
    SIFT output SubstitutionProbability Prediction Confidence M24S 0.04 Affect Function Low S82T 0.36 Tolerated High V247A 0.03 Affect Function High !!!
  • 13.
    Confidence is determinedby the diversity of sequences in the alignment many highly identical sequences Ideal case: Diverse set of orthologous proteins few sequences available Low confidence examples
  • 14.
    Case Study: LacIlac operon repressed LacI expressed lactose present normal state 4000 single amino acid substitutions assayed: throughout entire protein both neutral and affected phenotypes TIBS 22:334-339 c c
  • 15.
    Prediction on LacIsubstitutions 63% 28% Substitutions that affect protein function Substitutions that give no phenotype Total prediction accuracy 68% (2726/4004) Pr(observe affected phenotype | predicted to be damaging) 63% false - false + 37% 72% predicted to affect function predicted to be tolerated 37%
  • 16.
    False negative error: Positions not conserved among paralogues dimer & sugar interface not conserved
  • 17.
    False positive errorin LacI: surface with unknown function?
  • 18.
    SIFTing human variantdatabases 69% 25% Substitutions involved in disease 7397 subst., 606 proteins from SWISS-PROT Predicted on 76% proteins 71% subst nsSNPs in normal individuals 19% Putative polymorphisms 5780 nsSNPs, 3005 proteins from dbSNP Predicted on 60% prot., 53% subst. 185 nsSNPs, 69 proteins from Whitehead Institute Predicted on 77% prot. 62% subst 31% 81% 75%
  • 19.
    On functionally neutralsubstitutions, expected false positive error ~20% dbSNP nsSNPs in normal individuals Whitehead Institute Putative polymorphisms suggests that most nsSNPs are functionally neutral What accounts for the 5% difference? 25% 19%
  • 20.
    Account for 5%difference in dbSNP 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes/regions 3) Substitutions detected in error Supports SIFT as a prediction tool
  • 21.
    Account for 5%difference in dbSNP 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes/regions 3) Substitutions detected in error Supports SIFT as a prediction tool
  • 22.
    Mutations in MSHR increase skin cancer Mutations associated with cutaneous malignant melanoma 1 Mutations not associated with CMM 1-3 1 Am. J. Hum. Genet . 66: 176-186, 2 J. Invest. Dermatol . 116 :224-229, 3 J. Invest. Dermatol . 112: 512-513 R151C L60V  R151C  D294H  R160W Tolerated Affect function Prediction Substitution  L60V  R163Q  D84E Tolerated Affect function Prediction Substitution
  • 23.
    Mutations in PPAR  , a candidate gene for diabetes *** In diabetics and controls, but increases cholesterol levels in diabetics and perhaps nondiabetics 2-4 SIFT will detect what has been selected against in evolution; inappropriate assay may fail to detect 1 Am. J. Hum. Genet . 63:abs997 2 Diabetologia 43:673-680 3 Diabetes Metab . 26:393-401 4 J.Lipid Res. 41: 945-952 5 J. Hum. Genet . 46: 285-288 Mutations in diabetics 1 Mutations in nondiabetics 1-5  R127Q  R409T  D304N Tolerated Affect function Prediction Substitution  V227A  A268V  *** L162V Tolerated Affect function Prediction Substitution
  • 24.
    Mutations in MTHFR Mutations with diminished enzyme activity 1-5 Unknown effect Common Under balancing selection Increases neural tube defects Reduce risk for some types of leukemia Found by contig comparison 1 Nat. Genet . 10:111-113 2 PNAS 96:12810-12815 3 PNAS 98:4004-4009 4 Cancer Res . 57:1098-1102 5 Mol. Genet. Metab . 64: 169-172  A222V  E429A Tolerated Affect function Prediction Substitution  R68Q
  • 25.
    dbSNP variants frompatients Can distinguish patients from controls Individuals with disease: 18/22 predicted to be damaging Control individuals: 9/10 predicted to be functionally neutral SIFT detects what’s selected against in evolution & is independent of assay Example: PPAR  Detect substitutions that are deleterious in the context of the protein, not the organism Can detect nsSNPs with minor effects on phenotype genes increase risk of skin cancer, diabetes, cholesterol levels The protein need not be essential because SIFT predicts on the substitution. Can detect nsSNPs under balancing selection Example : MTHFR
  • 26.
    16 genes witha high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes or regions 3) Substitutions detected in error
  • 27.
    16 genes witha high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes or regions 3) Substitutions detected in error
  • 28.
    16 genes witha high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes/regions 3) Substitutions detected in error Changes found in patients Confirms SIFT prediction and its sensitivity Unlikely to affect human health Irrelevant to human health
  • 29.
    Comparison of PredictionTools 69% 69% 63% 75% 28% 9% 25% 32% 15% 19% Variagenics SIFT SIFT EMBL disease subst. LacI Variagenics SIFT LacI EMBL* 15% Variagenics SIFT SIFT EMBL SNP databases normal individuals Substitutions that affect function Substitutions that do not affect function Polymorphisms 31% 72% 69% 91% 75% 68% 81% 85% SIFT has similar prediction accuracy to tools that use structure
  • 30.
    http:// blocks.fhcrc.org/sift/SIFT.html SIFT,a prediction tool for the effect of substitutions prediction is based only on sequence Detect damaging nsSNPs on a large scale
  • 31.
    Association studies for finding disease loci Direct approach SNPs likely to affect gene function Association leads directly to candidate gene Fewer SNPs to genotype Indirect approach using haplotypes tagSNPs to identify common haplotypes in a region Relies on LD with causal variant Genotype 200K-1 million SNPs AATACGAT AATACGAT AATACGAT GATACAAC GATACAAC GATACAAC
  • 32.
    Feasibility of directapproach Have we identified all the causative variants? common variant, common disease hypothesis 80% of common SNPs in Europeans in dbSNP, 50% of common SNPs in Africans. Nat Genet. 33:518-21 What types of variants are involved in disease? nsSNPs & splicing variants account for a large proportion of Mendelian disease regulatory variation has a role in disease ~50% genes show allele-specific expression Science 297 :1143; Hum. Genet. 113: 149–153 ~1/3 of promoter variants may alter gene expression Hum. Mol. Genet. 12: 2249–2254
  • 33.
    nsSNPs SNPs nearintron/exon boundary UTRs and promoter region 4. synonymous SNPs Possible effect? In LD with causative variant? SNPs in and near genes protein function splicing regulation
  • 34.
    covers 20,024 genesDouble-hit and known-frequency SNPs in genes
  • 35.
    Non-genic regions couldpotentially harbor disease variants ~70% of bases in conserved sequences are noncoding (Genome Res. 13:2507-18 ) regulatory elements noncoding RNAs unknown genes 41,193 SNPs in noncoding conserved regions >= 80% identity with mouse
  • 36.
    Adding SNPs inconserved regions improves SNP density
  • 37.
    Focusing on variationin functional regions Large # of SNPs makes direct approach possible If causative variant is not in set, may be in LD with another SNP in the functional region Concentrating on functional regions allows interesting experiments genotyping DNA copy number allele-expression differences Complementary to the indirect approach using haplotypes
  • 38.
    Acknowledgments FHCRC SteveHenikoff Jorja Henikoff Henikoff Lab

Editor's Notes

  • #5 hemoglobin with is a tetramer of 2 alpha and 2 beta subunits. structure on left from J.Mol. Biol he High Resolution Crystal Structure of Deoxyhemoglobin S Daniel J. Harrington, Kazuhiko Adachi, William E. Royer, Jr The Journal of Molecular Biology V272 No. 3 pp. 398-407 September 1997 http://web.wi.mit.edu/proteins/pub/BOA-2000/left.htm
  • #6 Start off by with slide by defining SNPs 529 SNP/Mb in exon 921 SNP/Mb intron Nucleotide diversity
  • #8 40% of proteins belong to a family 70% has at least one other match
  • #9 (information content without correction) “In general,” “there are some exceptions”
  • #12 pseudocounts are based on prior knowledge of the most common amino acid distributions observed in a database of many protein alignments probabilities are calcualted for ever amino acid at every position position aa allowed 5Y all 20
  • #14 Ideal case: a variety of amino acids have had the time to evolve at positions not important for function. many highly identical sequences e.g. viral proteins, Ig’s (can be fixed by going to smaller database)
  • #16 1764 substitutions that affect function 2240 substitutions that give no phenotype Intermediate grouped with null Mention intermediate grouped with null 15% better total prediction accuracy 10% increase in experimental prediction accuracy
  • #18 white: tolerate &gt;= 6 substitutions in assay red : positions high false positive error
  • #19 what genes in whitehead SNPs. candidate genes for coronary artery disease, type II diabetes, schizophrenia
  • #22 Substitutions were first identified in patients and then deposited into dbSNP. Thus it makes sense that the substitutions should be preicted as damaging.
  • #30 when purine repressor , a LacI paralogue, used for prediction on LacI, Variagenics only predicted 19% of the substitutions that have an effect were correctly predicted as damaging.
  • #32 There are two genetic approaches that make use of the variation around genes to find disease loci. Haplotypes may be stronger predictors of phenotype (mirvana, chakravarti) haplotype a set of alleles grouped together haplotype is a group of SNPs that are linked together tagSNPs are most informative Neil Risch – reduced positive with direct appraoch
  • #33 Is the direct approach possible? Hoogendoorn, Bastiaan used reporter gene assays in cell lines. We have used denaturing high performance liquid chromatography to screen the first 500 bp of the 5&apos; flanking region of 170 opportunistically selected genes identified from the Eukaryotic Promoter Database (EPD) for common polymorphisms. Using a screening set of 16 chromosomes, single-nucleotide polymorphisms were found in approximately 35% of genes. It was attempted to clone each of these promoters into a T-vector constructed from the reporter gene vector pGL3. The relative ability of each promoter haplotype to promote transcription of the luciferase gene was tested in each of three human cell lines (HEK293, JEG and TE671) using a co-transfected SEAP-CMV plasmid as a control. The findings suggest that around a third of promoter variants may alter gene expression to a functionally relevant extent .
  • #34 causal variant may not have been identified. 80% common identified in European. 50% in Africans (Nickerson) rare variant some genes have no coverage – there may be no nsSNP or it has not yet been identified and deposited in dbSNP
  • #35 dbSNP 120, 3.5 million double hit and snps with frequencies