1. Developing Metrics to Discern
Apparent Study Power in
Protein Mutation Distributions
Anna Blendermann
Mentor: Arlin Stoltzfus
2. Deep Mutational Scanning
Deep Mutational Scanning: technique that uses high
throughput DNA sequencing to analyze protein mutations
Last month, an article appeared in Genetics with results
on 2000 mutants of the BRCA1 gene, which is linked to
breast cancer (http
://www.genetics.org/content/200/2/413)
3. The Inadequacy of Recent Studies
Understanding the effects of mutations is a major
challenge in genomics, evolution, and medicine
Recent studies show…
An unprecedented amount of data on the
effects of mutations in proteins
Unexplained differences in the power of studies
to discern effects in mutations
For example, Lind’s analysis of ribosomal protein mutations shows
little difference between missense and synonymous mutations. (
http://www.sciencemag.org/content/330/6005/825.abstract)
UNDISCERNABLE
GRAPH?!
Lind Study
4. GCC = alanine
GUC = valine
Different amino acid
Missense Mutations
Missense mutations change amino acids
Largest frequency among the three effect types
Expected to cause a wide variety of effects
5. CAG = glutamine
UAG = stop condon
Truncated protein
CAA = valine
CAT = valine
Different codon, same
amino acid
Nonsense & Synonymous
Mutations
Synonymous mutations…
- Change codons but not amino acids
- Have very small, beneficial effects
Nonsense mutations…
- Truncate proteins
- Have strong, deleterious effects
6. Learning and Implementing R
My project required learning R and writing code for the
development of analytical metrics
7. Using Rstudio for Data Analysis
Rstudio was used to compile distribution graphs of missense, nonsense,
and synonymous mutations, in stacked histogram form
Firnberg Study
Stacked
Histogram
Distributions
8. Visualizing DMS Data with Fitness
Fitness Distribution graphs are based on
Y-axis: frequency of protein mutations
X-axis: fitness level of the resulting cell
Frequency – number of mutations
Fitness – how fast the cell grows
Nonsense
Mutations
Missense
Mutations
Synonymous
Mutations
9. Visualizing DMS Data with Quantiles
Quantile Distribution graphs are based on
Y-axis: frequency of mutations relative
to the total number of protein
mutations
X-axis: fitness level of effect types
relative to the overall fitness of the cell
Frequency – number of mutations
Fitness – how fast the cell grows
Nonsense
Mutations
Synonymous
Mutations
Missense Mutations
10. Standard deviation of synonymous mutations
Difference of missense & nonsense averages
Difference of synonymous & missense averages
Difference of synonymous & nonsense averages
Difference of nonsense average and min fitness
Developing Metrics for
Quality Analysis Five Metrics were developed to assess the
quality of fitness and quantile distributions
11. #1
• Compute metric values
#2
• Get R^2 values from cross validation
#3
• Plot metrics vs. R^2 values
#4
• Graph linear regressions (best fit lines)
#5
• Calculate P-values for each plot
Completing Metric Analysis
with Five Steps
Metric Analysis – determining the
ability of each metric to evaluate
apparent study power
Apparent Study Power – how well
a distribution graph displays data
Our Five Steps
12. Computing Metrics for Nine
Mutation Studies
We had 25 studies, but only 9 studies contained the effect
types needed to calculate the metrics
Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness
Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622
Carrasco 0.235649 0.275932 0.192387 0.46832 0
Dc_phi NA 0.294643 NA NA 0.101563
Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862
Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372
Peris 0.146945 0.306863 0.321471 0.628333 0
Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418
Sanjuan 0.245534 0.263043 0.273641 0.536685 0
Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869
13. Getting R^2 Values from the
Cross Validation
R^2.values
0.14926564
0.15930005
0.58369074
0.18015482
0.22684849
0.17337046
0.21835122
0.18749149
0.14267328
R^2 values – mean quantile (exchangeability) values from
each study that measure power on 0-1 scale
VS
Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness
Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622
Carrasco 0.235649 0.275932 0.192387 0.46832 0
Dc_phi NA 0.294643 NA NA 0.101563
Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862
Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372
Peris 0.146945 0.306863 0.321471 0.628333 0
Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418
Sanjuan 0.245534 0.263043 0.273641 0.536685 0
Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869
14. Plot - Standard Deviation of
Synonymous Mutations
X-Axis Values: r^2 values
Y-Axis Values: sd(synonymous)
Linear Regression Slope: negative
18. Plot - Difference of Average
Nonsense and Minimum Fitness
X-Axis Values: r^2 values
Y-Axis Values: avg(non) – min(fitness)
Linear Regression Slope: zero
19. Correlating Metrics with Apparent
Study Power
P-Values: values calculated from linear regression that measure
the significance of correlation, lower values are better!
Metric P-Value
Standard Deviation of
Synonymous Mutations 0.014051
Difference of Missense &
Nonsense Averages 0.53634
Difference of Synonymous
& Nonsense Averages 0.304701
Difference of Synonymous
& Nonsense Averages 0.128621
Difference of Nonsense
Average and Min Fitness 0.975549
Column1
Mis.non
Syn.mis
Syn.non
Non.fitness
00.511.5
0.01
0.54
0.3
0.13
0.981
2 3 4 5
1
2 3 4 5
P-Value of Metrics
P-Values P-Values P-Values
20. 1. We developed one metric ideal for the quality analysis of mutation
distributions: Standard Deviation of Synonymous Mutations
2. There were not enough studies with available data to create linear
regressions that accurately evaluated the usability of each metric
3. We only tested five metrics, so there was already a 15%-20% chance
that at least one P-value < 0.05
Future Work: develop MORE METRICS from the mutational data from
MORE STUDIES, to help researchers accurately assess the quality of their
studies and thus, better discern the effects of mutations in proteins
Our Conclusions Based on
the Metric Analysis
21. Thank You
Dr. Arlin Stoltzfus, Mentor
Dr. Mary Satterfield, MML Chief of Staff
Dr. Brandi Tolliver, NIST SURF Director
The SURF Program