Next Generation Sequencing Bioinformatics:
From Data To Precision Medicine
April 11, 2017
Gabe Rudy, Vice President of
Product and Engineering
My Background
 Golden Helix
- Founded in 1998
- Genetic association software
- Clinical lab variant analysis software
- Thousands of users worldwide
- Over 1000 customer citations in journals
 Products I Build with My Team
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.
Sequencers: Versatile tools for science
Genomics is Big Data
 5,000 public data repositories
 Broad Institute:
- Process 40K samples/year
- 1000 people
- 51 High Throughput Sequencers
- 10+ PB of storage
 1 Genome in Data
- ~300GB Compressed Sequence Data
- ~150MB Compressed Variant Data
- Seq data went through 5-6 steps
Next Generation Sequencing Analysis
Primary
Analysis
Secondary
Analysis
Tertiary
Analysis
“Sense Making”
 Analysis of hardware generated data, software built by vendors
 Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
 Filtering/clipping of “reads” and their qualities
 Alignment/Assembly of reads
 Recalibrating, de-duplication, variant calling on aligned reads
 QA and filtering of variant calls
 Annotation (querying) variants to databases, filtering on results
 Merging/comparing multiple samples (multiple files)
 Visualization of variants in genomic context
 Statistics on matrixes
BWA+GATK Best Practices Pipeline
Agenda
Data Access Patterns: Databases or Flat Files?
Big Data Tables: Tricks from Data Warehousing
2
3
4
A Genomic Index: Specialized R-Trees, Bins, NC-Lists
Bioinformatics 101: Pipelines and File Formats1
Questions5
Genomic Data Lives in 1D Coordinate Space
FASTQ
 Contains 3 things per read:
- Sequence identifier (unique)
- Sequence bases [len N]
- Base quality scores [len N]
 Often “gzip” compressed (fq.gz)
 If not demultiplexed, first 4 or 6bp
is the “barcode” index. Used to
split lanes out by sample.
 Filtering may include:
- Removing adapters & primers
- Clip poor quality bases at ends
- Remove flagged low-quality reads
@HWI-ST845:4:1101:16436:2254#0/1
CAAACAGGCATGCGAGGTGCCTTTGGAAAGCCCCAGGGCACTGTGGCCAG
+
Y[SQORPMPYRSNP_][_babBBBBBBBBBBBBBBBBBBBBBBBBBB
Aligners
 1981 Smith and Waterman
- Dynamic algorithm
- Finds optimal local alignment of two sequences
- Seq of length m and n, O(mn) time required
 Hashing-Based Aligners (2008)
- SOAP, Eland, MAQ
- ~14GB RAM to use with human
 Burrows Wheeler Transform Aligners (2009)
- BWA, BowTie, SOAP2 (2009)
- Order of magnitude less RAM and Time
 Hybrid Aligners (2012/13)
- RTG, BWA-Mem, bowtie2, Issac
- Seed and expand
- Handle longer reads (>100bp) with larger gaper
BWT
Backtracking – query ‘ggta’ with 1 mismatch
SAM/BAM
 Spec defined by samtools author
Heng Li, aka Li H, aka lh3.
 SAM is text version (easy for any
program to output)
 BAM is binary/compressed version
with indexing support
 Alignment in terms of code of
matches, insertions, deletions,
gaps and clipping
 Can have any custom flags set by
analysis program (and many do)
Key Fields
 Chr, position
 Mapping quality
 CIGAR
 Name/position of mate
 Total template length
 Sequence
 Quality
Variant Callers
 Samtools
- “mpileup” command computes BAQ, preforms local realignment
- Many filters can be applied to get high-quality variants
 GATK
- More than just a variant caller, but UnifiedGenotyper is widely used
- Also provides pre-calling tools like local InDel realignment and quality
score recalibration
 FreeBayes
 Custom tools specific to platform:
- CASAVA includes a variant caller for illumna whole-genome data
- Ion Torrent has a caller that handles InDels better for their tech
 Commercial:
- Real Time Genomics
- Arpeggi
VCF
 Specification defined by the 1000 genomes
group (now v4.1)
 Commonly compressed indexed with
bgzip/tabix (allows for reading directly by a
Genome Browser)
 Contains arbitrary data per “site” (INFO
fields) and per sample
 Single-Sample VCF:
- Contains only the variants for the sample.
 Multi-Sample VCF:
- Whenever one sample has a variant, all samples get
a “genotype” (often “ref”)
 Caveat:
- VCF requires a reference base be specified. Leaving
insertions to be “encoded” 1bp differently than they
are annotated
- Various opinions on how to encode CNV/SV
- gVCF is a VCF file with lines for tracts of ref match
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Sa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequen
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral All
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membershi
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 members
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have d
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Q
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype
Visualization
 Genome browsers:
- Validate variant calls
- Look at gene annotations,
problematic regions, population
catalogs
- Compare samples where no
variant called
 Free Genome Browsers:
- IGV
- Popular desktop by Broad
- UCSC
- Web-based, most extensive
annotations
- GenomeBrowse
- Designed to be publication ready
- Smooth zoom and navigation
Sample Variant Analysis Workflow
Filter out common and
low-quality variants
Filter by inheritance
or zygosity state
Reduce to non-
synonymous
Prioritize
Remaining
Variants
?
VCF file goes in
 Many NGS tertiary analysis
workflows follow a system of
annotation-based filtering
 Common to have a long list of
candidate variants
 Variants need to be prioritized
for validation experiments
 Prioritizing those candidates
is extrememly important, but can
be a very difficult process
Annotation with Public Data
 Public Annotation Data and Tools
- Most produced through academic research our consortia
- Centralized hosting on NCBI, Ensembl, UCSC
 Important categories:
- Population catalogs: how common is a variant?
- Gene/Transcript: is a variant in a gene and how does it change the gene?
- In-silico predictions: How likely is a variant to impare the genes function?
- Knowledgebase: What do we know about particular variants/genes in human diseases?
Population Catalogs
 1000 Genomes (WGS, Exome, SNP Array)
- Many releases, most recent now
standardized, still incrementally updated
- 2,500 genomes – Phase3
 “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS)
- Had many releases, now V2-SSA137 0.0.30
- European American / African American only
 ExAC (Broad 61,486 Exomes v0.3)
- Many sub-populations
 Supercentenarians (110+ yo, 17 WGS)
- Available as raw Complete Genomic data
- Requires normalizing to match Illumina NGS
InSilico Predictions
 Non-synonymous functional
predictions
- SIFT, Polyphen2, LFT, MutationTaster,
MutationAccessor, FATHMM
 Conservation
- GERP++, PhyloP, phastCons
 All-In-One Scores
- CADD, VAAST,VEST3, DANN, FATHM-
MKL, MetaSVM and MetaLR
- Use machine learning, “feature selection”,
train and predict on public databases
- Can predicting synonymous and intergenic
 dbNSFP 3.0 – 82M precomputed scores
- N of 6 Voting on prediction algorithms
 RNA Splicing Effect (dbscSNV)
- 5+ splice algorithms, can pre-compute
- −3 to +8 at the 5’, −12 to +2 at the 3’
Disease Knowledgebases
 ClinVar
- Voluntary submissions of lab
- Use 5-tier classification (variant + phenotype
pairs)
- Star-rating of variants
- Lab owns submission, can revoke and
monitor status
 ClinVitae (Invitae curated, not updated)
 OMIM
- Gene to Phenotype documentation
- Expertly curated of literature, hand updated
- Changes dynamically
- Small list of cited / implicated variants
 HGMD
- Commercially supported
- Best linkage of (possible) publication to
variant/genes
- Classifications not directly trusted
 Your own Lab (more later)
My Exome Case Study 1:
Hemizygous OTC Pathogenic
X:38226614 - G/A
• Novel in all Population Catalogs… except ExAC’s ~60K exomes!
X:38226614 - G/A
• Recent Addition to ClinVar:
• 2013-05-09 G/A - Untested with Disease Unspecified
• 2014-03-03 G/A – Changed to “Pathogenic with not_provided”
1 Citation:
X:38226614 - G/A
• Cited PubMed article was on ResearchGate, Hiroki Morizono contacted
• Provided full text and lots of interesting backstory on OTC
• “If you are able to eat all the steak you want, you may have the mutation; it would
appear to be a hypomorphic allele (and a very mild one at that)”
• “Is possible that the late onset case that [was] identified may have been someone
who was having a very bad day, and several things went poorly for them.”
• “The R40H mutation, there was a grandfather or granduncle who was affected who
ate whatever he wanted, and seemed unaffected while the proband had several
episodes.”
X:38226614 - G/A
• Most likely partial penetrance, with potential risk of triggering with shock event
• The Glycine is conserved down to Opossum (Platypus, Zebafish has a Alanine)
Live Analysis
Questions?
The Central Dogma of Molecular Biology
“The central dogma of molecular biology deals with
the detailed residue-by-residue transfer of
sequential information. It states that such
information cannot be transferred back from protein
to either protein or nucleic acid.”
-- Francis Crick, 1958
 In other words:
- DNA is transcribed to RNA
- RNA is translated to create proteins
- Unidirectional process
 Protein is where damaging effects of a DNA
mutation will be observed
 Functional prediction algorithms are based
almost entirely on protein sequences
Image from Wikimedia Commons, Dhorspool
Transcription
 Transcription is the process by which an RNA transcript is created from DNA
within the cell nucleus before moving to the cytoplasm
 Includes splicing exons
together to create
meaningful
transcripts
 The complete collection
of mRNA transcripts in
a given cell or tissue is
often called the
“transcriptome”
Image from genome.gov
Translation
 mRNA transcripts are converted to
amino acid sequences via the
translation process
 Think of it as a different language;
nucleic acids versus amino acids
Images from genome.gov and WikiMedia Commons, by ladyofhats
Amino Acid Properties
 Amino Acids are distinguished by their
respective residues (aka side-chains
or R-groups)
 Residues are classified by polarity,
volume, hydrophobic and other
physicochemical properties
Images from WikiMedia Commons,
by YassineMrabet and DanCojocari
Levels of Protein Structure
 Primary Structure
- Linear sequence of amino acids
 Secondary Structure
- Interaction between amino acids via hydrogen
bonding results in regular substructures called
alpha helices and beta sheets
 Tertiary Structure
- The final three-dimensional form of an amino
acid chain
- Is influenced by attractions between
secondary structures
 Quaternary Structure
- Several tertiary structures may interact to
form quaternary structures
Image from WikiMedia Commons, ladyofhats
From Structure to Function
 Proteins include various types of functional domains, binding sites and other
surface features
- This determines how the protein interacts with other molecules
 Replacing certain amino acids may have drastic effects on the protein structure
- Thereby affecting the protein function
http://www.vanderbilt.edu/vicb/DiscoveriesArchives/g_protein_receptor.html
 If we know how the protein
structure is affected by an
amino acid substitution, we
can make a good guess
about functional
consequences.
 The problem is that we
don’t know the wild-type
3D strucuture of most
proteins.
Using Primary Structure as Proxy for Tertiary
 83% of disease-causing mutations affect
stability of proteins (Wang and Moult, 2001)
 90% of disease-causing mutations can be
detected using structure and stability
 Many human proteins have numerous
homologs:
- Paralogs: Separated by a gene duplication
event
- Orthologs: Separated by speciation
 Don’t know the exact structure of most
proteins, but we can compare amino acid
sequences to identify domains and motifs
conserved by evolution
 Disease causing mutations are
overrepresented at conserved sites in the
primary structure (Miller and Kumar, 2001)

CS Lecture 2017 04-11 from Data to Precision Medicine

  • 1.
    Next Generation SequencingBioinformatics: From Data To Precision Medicine April 11, 2017 Gabe Rudy, Vice President of Product and Engineering
  • 2.
    My Background  GoldenHelix - Founded in 1998 - Genetic association software - Clinical lab variant analysis software - Thousands of users worldwide - Over 1000 customer citations in journals  Products I Build with My Team - VarSeq - Annotate and filter variants in gene panels, exomes and genomes for clinical labs and researchers. - SNP & Variation Suite (SVS) - SNP, CNV, NGS tertiary analysis - Import and deal with all flavors of upstream data - GenomeBrowse (Free!) - Visualization of everything with genomic coordinates. All standardized file formats.
  • 5.
  • 6.
    Genomics is BigData  5,000 public data repositories  Broad Institute: - Process 40K samples/year - 1000 people - 51 High Throughput Sequencers - 10+ PB of storage  1 Genome in Data - ~300GB Compressed Sequence Data - ~150MB Compressed Variant Data - Seq data went through 5-6 steps
  • 8.
    Next Generation SequencingAnalysis Primary Analysis Secondary Analysis Tertiary Analysis “Sense Making”  Analysis of hardware generated data, software built by vendors  Use FPGA and GPUs to handle real-time optical or eletrical signals from sequencing hardware  Filtering/clipping of “reads” and their qualities  Alignment/Assembly of reads  Recalibrating, de-duplication, variant calling on aligned reads  QA and filtering of variant calls  Annotation (querying) variants to databases, filtering on results  Merging/comparing multiple samples (multiple files)  Visualization of variants in genomic context  Statistics on matrixes
  • 9.
  • 10.
    Agenda Data Access Patterns:Databases or Flat Files? Big Data Tables: Tricks from Data Warehousing 2 3 4 A Genomic Index: Specialized R-Trees, Bins, NC-Lists Bioinformatics 101: Pipelines and File Formats1 Questions5
  • 11.
    Genomic Data Livesin 1D Coordinate Space
  • 12.
    FASTQ  Contains 3things per read: - Sequence identifier (unique) - Sequence bases [len N] - Base quality scores [len N]  Often “gzip” compressed (fq.gz)  If not demultiplexed, first 4 or 6bp is the “barcode” index. Used to split lanes out by sample.  Filtering may include: - Removing adapters & primers - Clip poor quality bases at ends - Remove flagged low-quality reads @HWI-ST845:4:1101:16436:2254#0/1 CAAACAGGCATGCGAGGTGCCTTTGGAAAGCCCCAGGGCACTGTGGCCAG + Y[SQORPMPYRSNP_][_babBBBBBBBBBBBBBBBBBBBBBBBBBB
  • 13.
    Aligners  1981 Smithand Waterman - Dynamic algorithm - Finds optimal local alignment of two sequences - Seq of length m and n, O(mn) time required  Hashing-Based Aligners (2008) - SOAP, Eland, MAQ - ~14GB RAM to use with human  Burrows Wheeler Transform Aligners (2009) - BWA, BowTie, SOAP2 (2009) - Order of magnitude less RAM and Time  Hybrid Aligners (2012/13) - RTG, BWA-Mem, bowtie2, Issac - Seed and expand - Handle longer reads (>100bp) with larger gaper
  • 14.
  • 15.
    Backtracking – query‘ggta’ with 1 mismatch
  • 16.
    SAM/BAM  Spec definedby samtools author Heng Li, aka Li H, aka lh3.  SAM is text version (easy for any program to output)  BAM is binary/compressed version with indexing support  Alignment in terms of code of matches, insertions, deletions, gaps and clipping  Can have any custom flags set by analysis program (and many do) Key Fields  Chr, position  Mapping quality  CIGAR  Name/position of mate  Total template length  Sequence  Quality
  • 17.
    Variant Callers  Samtools -“mpileup” command computes BAQ, preforms local realignment - Many filters can be applied to get high-quality variants  GATK - More than just a variant caller, but UnifiedGenotyper is widely used - Also provides pre-calling tools like local InDel realignment and quality score recalibration  FreeBayes  Custom tools specific to platform: - CASAVA includes a variant caller for illumna whole-genome data - Ion Torrent has a caller that handles InDels better for their tech  Commercial: - Real Time Genomics - Arpeggi
  • 18.
    VCF  Specification definedby the 1000 genomes group (now v4.1)  Commonly compressed indexed with bgzip/tabix (allows for reading directly by a Genome Browser)  Contains arbitrary data per “site” (INFO fields) and per sample  Single-Sample VCF: - Contains only the variants for the sample.  Multi-Sample VCF: - Whenever one sample has a variant, all samples get a “genotype” (often “ref”)  Caveat: - VCF requires a reference base be specified. Leaving insertions to be “encoded” 1bp differently than they are annotated - Various opinions on how to encode CNV/SV - gVCF is a VCF file with lines for tracts of ref match ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Sa ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth" ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequen ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral All ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membershi ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 members ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have d ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Q ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype
  • 19.
    Visualization  Genome browsers: -Validate variant calls - Look at gene annotations, problematic regions, population catalogs - Compare samples where no variant called  Free Genome Browsers: - IGV - Popular desktop by Broad - UCSC - Web-based, most extensive annotations - GenomeBrowse - Designed to be publication ready - Smooth zoom and navigation
  • 20.
    Sample Variant AnalysisWorkflow Filter out common and low-quality variants Filter by inheritance or zygosity state Reduce to non- synonymous Prioritize Remaining Variants ? VCF file goes in  Many NGS tertiary analysis workflows follow a system of annotation-based filtering  Common to have a long list of candidate variants  Variants need to be prioritized for validation experiments  Prioritizing those candidates is extrememly important, but can be a very difficult process
  • 21.
    Annotation with PublicData  Public Annotation Data and Tools - Most produced through academic research our consortia - Centralized hosting on NCBI, Ensembl, UCSC  Important categories: - Population catalogs: how common is a variant? - Gene/Transcript: is a variant in a gene and how does it change the gene? - In-silico predictions: How likely is a variant to impare the genes function? - Knowledgebase: What do we know about particular variants/genes in human diseases?
  • 22.
    Population Catalogs  1000Genomes (WGS, Exome, SNP Array) - Many releases, most recent now standardized, still incrementally updated - 2,500 genomes – Phase3  “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS) - Had many releases, now V2-SSA137 0.0.30 - European American / African American only  ExAC (Broad 61,486 Exomes v0.3) - Many sub-populations  Supercentenarians (110+ yo, 17 WGS) - Available as raw Complete Genomic data - Requires normalizing to match Illumina NGS
  • 23.
    InSilico Predictions  Non-synonymousfunctional predictions - SIFT, Polyphen2, LFT, MutationTaster, MutationAccessor, FATHMM  Conservation - GERP++, PhyloP, phastCons  All-In-One Scores - CADD, VAAST,VEST3, DANN, FATHM- MKL, MetaSVM and MetaLR - Use machine learning, “feature selection”, train and predict on public databases - Can predicting synonymous and intergenic  dbNSFP 3.0 – 82M precomputed scores - N of 6 Voting on prediction algorithms  RNA Splicing Effect (dbscSNV) - 5+ splice algorithms, can pre-compute - −3 to +8 at the 5’, −12 to +2 at the 3’
  • 24.
    Disease Knowledgebases  ClinVar -Voluntary submissions of lab - Use 5-tier classification (variant + phenotype pairs) - Star-rating of variants - Lab owns submission, can revoke and monitor status  ClinVitae (Invitae curated, not updated)  OMIM - Gene to Phenotype documentation - Expertly curated of literature, hand updated - Changes dynamically - Small list of cited / implicated variants  HGMD - Commercially supported - Best linkage of (possible) publication to variant/genes - Classifications not directly trusted  Your own Lab (more later)
  • 25.
    My Exome CaseStudy 1: Hemizygous OTC Pathogenic
  • 26.
    X:38226614 - G/A •Novel in all Population Catalogs… except ExAC’s ~60K exomes!
  • 27.
    X:38226614 - G/A •Recent Addition to ClinVar: • 2013-05-09 G/A - Untested with Disease Unspecified • 2014-03-03 G/A – Changed to “Pathogenic with not_provided” 1 Citation:
  • 28.
    X:38226614 - G/A •Cited PubMed article was on ResearchGate, Hiroki Morizono contacted • Provided full text and lots of interesting backstory on OTC • “If you are able to eat all the steak you want, you may have the mutation; it would appear to be a hypomorphic allele (and a very mild one at that)” • “Is possible that the late onset case that [was] identified may have been someone who was having a very bad day, and several things went poorly for them.” • “The R40H mutation, there was a grandfather or granduncle who was affected who ate whatever he wanted, and seemed unaffected while the proband had several episodes.”
  • 29.
    X:38226614 - G/A •Most likely partial penetrance, with potential risk of triggering with shock event • The Glycine is conserved down to Opossum (Platypus, Zebafish has a Alanine)
  • 30.
  • 31.
  • 32.
    The Central Dogmaof Molecular Biology “The central dogma of molecular biology deals with the detailed residue-by-residue transfer of sequential information. It states that such information cannot be transferred back from protein to either protein or nucleic acid.” -- Francis Crick, 1958  In other words: - DNA is transcribed to RNA - RNA is translated to create proteins - Unidirectional process  Protein is where damaging effects of a DNA mutation will be observed  Functional prediction algorithms are based almost entirely on protein sequences Image from Wikimedia Commons, Dhorspool
  • 33.
    Transcription  Transcription isthe process by which an RNA transcript is created from DNA within the cell nucleus before moving to the cytoplasm  Includes splicing exons together to create meaningful transcripts  The complete collection of mRNA transcripts in a given cell or tissue is often called the “transcriptome” Image from genome.gov
  • 34.
    Translation  mRNA transcriptsare converted to amino acid sequences via the translation process  Think of it as a different language; nucleic acids versus amino acids Images from genome.gov and WikiMedia Commons, by ladyofhats
  • 35.
    Amino Acid Properties Amino Acids are distinguished by their respective residues (aka side-chains or R-groups)  Residues are classified by polarity, volume, hydrophobic and other physicochemical properties Images from WikiMedia Commons, by YassineMrabet and DanCojocari
  • 36.
    Levels of ProteinStructure  Primary Structure - Linear sequence of amino acids  Secondary Structure - Interaction between amino acids via hydrogen bonding results in regular substructures called alpha helices and beta sheets  Tertiary Structure - The final three-dimensional form of an amino acid chain - Is influenced by attractions between secondary structures  Quaternary Structure - Several tertiary structures may interact to form quaternary structures Image from WikiMedia Commons, ladyofhats
  • 37.
    From Structure toFunction  Proteins include various types of functional domains, binding sites and other surface features - This determines how the protein interacts with other molecules  Replacing certain amino acids may have drastic effects on the protein structure - Thereby affecting the protein function http://www.vanderbilt.edu/vicb/DiscoveriesArchives/g_protein_receptor.html  If we know how the protein structure is affected by an amino acid substitution, we can make a good guess about functional consequences.  The problem is that we don’t know the wild-type 3D strucuture of most proteins.
  • 38.
    Using Primary Structureas Proxy for Tertiary  83% of disease-causing mutations affect stability of proteins (Wang and Moult, 2001)  90% of disease-causing mutations can be detected using structure and stability  Many human proteins have numerous homologs: - Paralogs: Separated by a gene duplication event - Orthologs: Separated by speciation  Don’t know the exact structure of most proteins, but we can compare amino acid sequences to identify domains and motifs conserved by evolution  Disease causing mutations are overrepresented at conserved sites in the primary structure (Miller and Kumar, 2001)