Statistical Analysis of Protein Mutations - Copy

Developing Metrics to Discern
Apparent Study Power in
Protein Mutation Distributions
Anna Blendermann
Mentor: Arlin Stoltzfus

Deep Mutational Scanning
 Deep Mutational Scanning: technique that uses high
throughput DNA sequencing to analyze protein mutations
 Last month, an article appeared in Genetics with results
on 2000 mutants of the BRCA1 gene, which is linked to
breast cancer (http
://www.genetics.org/content/200/2/413)

The Inadequacy of Recent Studies
Understanding the effects of mutations is a major
challenge in genomics, evolution, and medicine
Recent studies show…
 An unprecedented amount of data on the
effects of mutations in proteins
 Unexplained differences in the power of studies
to discern effects in mutations
For example, Lind’s analysis of ribosomal protein mutations shows
little difference between missense and synonymous mutations. (
http://www.sciencemag.org/content/330/6005/825.abstract)
UNDISCERNABLE
GRAPH?!
Lind Study

GCC = alanine
GUC = valine
Different amino acid
Missense Mutations
 Missense mutations change amino acids
 Largest frequency among the three effect types
 Expected to cause a wide variety of effects

CAG = glutamine
UAG = stop condon
Truncated protein
CAA = valine
CAT = valine
Different codon, same
amino acid
Nonsense & Synonymous
Mutations
Synonymous mutations…
- Change codons but not amino acids
- Have very small, beneficial effects
Nonsense mutations…
- Truncate proteins
- Have strong, deleterious effects

Learning and Implementing R
 My project required learning R and writing code for the
development of analytical metrics

Using Rstudio for Data Analysis
 Rstudio was used to compile distribution graphs of missense, nonsense,
and synonymous mutations, in stacked histogram form
Firnberg Study
Stacked
Histogram
Distributions

Visualizing DMS Data with Fitness
Fitness Distribution graphs are based on
 Y-axis: frequency of protein mutations
 X-axis: fitness level of the resulting cell
Frequency – number of mutations
Fitness – how fast the cell grows
Nonsense
Mutations
Missense
Mutations
Synonymous
Mutations

Visualizing DMS Data with Quantiles
Quantile Distribution graphs are based on
 Y-axis: frequency of mutations relative
to the total number of protein
mutations
 X-axis: fitness level of effect types
relative to the overall fitness of the cell
Frequency – number of mutations
Fitness – how fast the cell grows
Nonsense
Mutations
Synonymous
Mutations
Missense Mutations

Standard deviation of synonymous mutations
Difference of missense & nonsense averages
Difference of synonymous & missense averages
Difference of synonymous & nonsense averages
Difference of nonsense average and min fitness
Developing Metrics for
Quality Analysis Five Metrics were developed to assess the
quality of fitness and quantile distributions

#1
• Compute metric values
#2
• Get R^2 values from cross validation
#3
• Plot metrics vs. R^2 values
#4
• Graph linear regressions (best fit lines)
#5
• Calculate P-values for each plot
Completing Metric Analysis
with Five Steps
Metric Analysis – determining the
ability of each metric to evaluate
apparent study power
Apparent Study Power – how well
a distribution graph displays data
Our Five Steps

Computing Metrics for Nine
Mutation Studies
 We had 25 studies, but only 9 studies contained the effect
types needed to calculate the metrics
Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness
Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622
Carrasco 0.235649 0.275932 0.192387 0.46832 0
Dc_phi NA 0.294643 NA NA 0.101563
Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862
Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372
Peris 0.146945 0.306863 0.321471 0.628333 0
Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418
Sanjuan 0.245534 0.263043 0.273641 0.536685 0
Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869

Getting R^2 Values from the
Cross Validation
R^2.values
0.14926564
0.15930005
0.58369074
0.18015482
0.22684849
0.17337046
0.21835122
0.18749149
0.14267328
 R^2 values – mean quantile (exchangeability) values from
each study that measure power on 0-1 scale
VS
Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness
Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622
Carrasco 0.235649 0.275932 0.192387 0.46832 0
Dc_phi NA 0.294643 NA NA 0.101563
Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862
Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372
Peris 0.146945 0.306863 0.321471 0.628333 0
Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418
Sanjuan 0.245534 0.263043 0.273641 0.536685 0
Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869

Plot - Standard Deviation of
Synonymous Mutations
 X-Axis Values: r^2 values
 Y-Axis Values: sd(synonymous)
 Linear Regression Slope: negative

Plot - Difference of Missense &
Nonsense Averages
 Y-Axis Values: avg(mis) – avg(non)
 Linear Regression Slope: positive

Plot - Difference of Synonymous &
Missense Averages
 Y-Axis Values: avg(syn) – avg(mis)

Plot - Difference of Synonymous &
Nonsense Averages
 Y-Axis Values: avg(syn) – avg(non)

Plot - Difference of Average
Nonsense and Minimum Fitness
 Y-Axis Values: avg(non) – min(fitness)
 Linear Regression Slope: zero

Correlating Metrics with Apparent
Study Power
 P-Values: values calculated from linear regression that measure
the significance of correlation, lower values are better!
Metric P-Value
Standard Deviation of
Synonymous Mutations 0.014051
Difference of Missense &
Nonsense Averages 0.53634
Difference of Synonymous
& Nonsense Averages 0.304701
Difference of Synonymous
& Nonsense Averages 0.128621
Difference of Nonsense
Average and Min Fitness 0.975549
Column1
Mis.non
Syn.mis
Syn.non
Non.fitness
00.511.5
0.01
0.54
0.3
0.13
0.981
2 3 4 5
1
2 3 4 5
P-Value of Metrics
P-Values P-Values P-Values

1. We developed one metric ideal for the quality analysis of mutation
distributions: Standard Deviation of Synonymous Mutations
2. There were not enough studies with available data to create linear
regressions that accurately evaluated the usability of each metric
3. We only tested five metrics, so there was already a 15%-20% chance
that at least one P-value < 0.05
Future Work: develop MORE METRICS from the mutational data from
MORE STUDIES, to help researchers accurately assess the quality of their
studies and thus, better discern the effects of mutations in proteins
Our Conclusions Based on
the Metric Analysis

Thank You
Dr. Arlin Stoltzfus, Mentor
Dr. Mary Satterfield, MML Chief of Staff
Dr. Brandi Tolliver, NIST SURF Director
The SURF Program

Statistical Analysis of Protein Mutations - Copy

Recommended

Recommended

More Related Content

Similar to Statistical Analysis of Protein Mutations - Copy

Similar to Statistical Analysis of Protein Mutations - Copy (20)

Statistical Analysis of Protein Mutations - Copy

Editor's Notes