SlideShare a Scribd company logo
1 of 21
Developing Metrics to Discern
Apparent Study Power in
Protein Mutation Distributions
Anna Blendermann
Mentor: Arlin Stoltzfus
Deep Mutational Scanning
 Deep Mutational Scanning: technique that uses high
throughput DNA sequencing to analyze protein mutations
 Last month, an article appeared in Genetics with results
on 2000 mutants of the BRCA1 gene, which is linked to
breast cancer (http
://www.genetics.org/content/200/2/413)    
The Inadequacy of Recent Studies
Understanding the effects of mutations is a major
challenge in genomics, evolution, and medicine
Recent studies show…
 An unprecedented amount of data on the
effects of mutations in proteins
 Unexplained differences in the power of studies
to discern effects in mutations 
For example, Lind’s analysis of ribosomal protein mutations shows
little difference between missense and synonymous mutations. (
http://www.sciencemag.org/content/330/6005/825.abstract)
UNDISCERNABLE
GRAPH?!
Lind Study
GCC = alanine
GUC = valine
Different amino acid
Missense Mutations
 Missense mutations change amino acids
 Largest frequency among the three effect types
 Expected to cause a wide variety of effects
CAG = glutamine
UAG = stop condon
Truncated protein
CAA = valine
CAT = valine
Different codon, same
amino acid
Nonsense & Synonymous
Mutations
Synonymous mutations…
- Change codons but not amino acids
- Have very small, beneficial effects
Nonsense mutations…
- Truncate proteins
- Have strong, deleterious effects
Learning and Implementing R
 My project required learning R and writing code for the
development of analytical metrics
Using Rstudio for Data Analysis
 Rstudio was used to compile distribution graphs of missense, nonsense,
and synonymous mutations, in stacked histogram form
Firnberg Study
Stacked
Histogram
Distributions
Visualizing DMS Data with Fitness
Fitness Distribution graphs are based on
 Y-axis: frequency of protein mutations
 X-axis: fitness level of the resulting cell
Frequency – number of mutations
Fitness – how fast the cell grows
Nonsense
Mutations
Missense
Mutations
Synonymous
Mutations
Visualizing DMS Data with Quantiles
Quantile Distribution graphs are based on
 Y-axis: frequency of mutations relative
to the total number of protein
mutations
 X-axis: fitness level of effect types
relative to the overall fitness of the cell
Frequency – number of mutations
Fitness – how fast the cell grows
Nonsense
Mutations
Synonymous
Mutations
Missense Mutations
Standard deviation of synonymous mutations
Difference of missense & nonsense averages
Difference of synonymous & missense averages
Difference of synonymous & nonsense averages
Difference of nonsense average and min fitness
Developing Metrics for
Quality Analysis Five Metrics were developed to assess the
quality of fitness and quantile distributions
#1
• Compute metric values
#2
• Get R^2 values from cross validation
#3
• Plot metrics vs. R^2 values
#4
• Graph linear regressions (best fit lines)
#5
• Calculate P-values for each plot
Completing Metric Analysis
with Five Steps
Metric Analysis – determining the
ability of each metric to evaluate
apparent study power
Apparent Study Power – how well
a distribution graph displays data
Our Five Steps
Computing Metrics for Nine
Mutation Studies
 We had 25 studies, but only 9 studies contained the effect
types needed to calculate the metrics
Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness
Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622
Carrasco 0.235649 0.275932 0.192387 0.46832 0
Dc_phi NA 0.294643 NA NA 0.101563
Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862
Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372
Peris 0.146945 0.306863 0.321471 0.628333 0
Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418
Sanjuan 0.245534 0.263043 0.273641 0.536685 0
Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869
Getting R^2 Values from the
Cross Validation
R^2.values
0.14926564
0.15930005
0.58369074
0.18015482
0.22684849
0.17337046
0.21835122
0.18749149
0.14267328
 R^2 values – mean quantile (exchangeability) values from
each study that measure power on 0-1 scale
VS
Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness
Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622
Carrasco 0.235649 0.275932 0.192387 0.46832 0
Dc_phi NA 0.294643 NA NA 0.101563
Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862
Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372
Peris 0.146945 0.306863 0.321471 0.628333 0
Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418
Sanjuan 0.245534 0.263043 0.273641 0.536685 0
Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869
Plot - Standard Deviation of
Synonymous Mutations
 X-Axis Values: r^2 values
 Y-Axis Values: sd(synonymous)
 Linear Regression Slope: negative
Plot - Difference of Missense &
Nonsense Averages
 X-Axis Values: r^2 values
 Y-Axis Values: avg(mis) – avg(non)
 Linear Regression Slope: positive
Plot - Difference of Synonymous &
Missense Averages
 X-Axis Values: r^2 values
 Y-Axis Values: avg(syn) – avg(mis)
 Linear Regression Slope: positive
Plot - Difference of Synonymous &
Nonsense Averages
 X-Axis Values: r^2 values
 Y-Axis Values: avg(syn) – avg(non)
 Linear Regression Slope: positive
Plot - Difference of Average
Nonsense and Minimum Fitness
 X-Axis Values: r^2 values
 Y-Axis Values: avg(non) – min(fitness)
 Linear Regression Slope: zero
Correlating Metrics with Apparent
Study Power
 P-Values: values calculated from linear regression that measure
the significance of correlation, lower values are better!
Metric P-Value
Standard Deviation of
Synonymous Mutations 0.014051
Difference of Missense &
Nonsense Averages 0.53634
Difference of Synonymous
& Nonsense Averages 0.304701
Difference of Synonymous
& Nonsense Averages 0.128621
Difference of Nonsense
Average and Min Fitness 0.975549
Column1
Mis.non
Syn.mis
Syn.non
Non.fitness
00.511.5
0.01
0.54
0.3
0.13
0.981
2 3 4 5
1
2 3 4 5
P-Value of Metrics
P-Values P-Values P-Values
1. We developed one metric ideal for the quality analysis of mutation
distributions: Standard Deviation of Synonymous Mutations
2. There were not enough studies with available data to create linear
regressions that accurately evaluated the usability of each metric
3. We only tested five metrics, so there was already a 15%-20% chance
that at least one P-value < 0.05
Future Work: develop MORE METRICS from the mutational data from
MORE STUDIES, to help researchers accurately assess the quality of their
studies and thus, better discern the effects of mutations in proteins
Our Conclusions Based on
the Metric Analysis
Thank You
Dr. Arlin Stoltzfus, Mentor
Dr. Mary Satterfield, MML Chief of Staff
Dr. Brandi Tolliver, NIST SURF Director
The SURF Program

More Related Content

Similar to Statistical Analysis of Protein Mutations - Copy

Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IJames Neill
 
this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxhowardh5
 
A presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptA presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptvigia41
 
s.analysis
s.analysiss.analysis
s.analysiskavi ...
 
Microarray Statistics
Microarray StatisticsMicroarray Statistics
Microarray StatisticsA Roy
 
Quantitative Analysis: Conducting, Interpreting, & Writing
Quantitative Analysis: Conducting, Interpreting, & WritingQuantitative Analysis: Conducting, Interpreting, & Writing
Quantitative Analysis: Conducting, Interpreting, & WritingStatistics Solutions
 
Non Parametric Tests
Non Parametric TestsNon Parametric Tests
Non Parametric TestsZarrin Ansari
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Projectbutest
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 
Quantitative_analysis.ppt
Quantitative_analysis.pptQuantitative_analysis.ppt
Quantitative_analysis.pptmousaderhem1
 
Qualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic ModelQualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic Modelijceronline
 
Biostatistics
BiostatisticsBiostatistics
Biostatisticspriyarokz
 
April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021April Heyward
 
Meta-Analysis in Ayurveda
Meta-Analysis in AyurvedaMeta-Analysis in Ayurveda
Meta-Analysis in AyurvedaAyurdata
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data setsimprovemed
 

Similar to Statistical Analysis of Protein Mutations - Copy (20)

Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA I
 
6SigmaReferenceMaterials
6SigmaReferenceMaterials6SigmaReferenceMaterials
6SigmaReferenceMaterials
 
this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docx
 
Biostatistics ii
Biostatistics iiBiostatistics ii
Biostatistics ii
 
A presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptA presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.ppt
 
s.analysis
s.analysiss.analysis
s.analysis
 
Microarray Statistics
Microarray StatisticsMicroarray Statistics
Microarray Statistics
 
Quantitative Analysis: Conducting, Interpreting, & Writing
Quantitative Analysis: Conducting, Interpreting, & WritingQuantitative Analysis: Conducting, Interpreting, & Writing
Quantitative Analysis: Conducting, Interpreting, & Writing
 
Non Parametric Tests
Non Parametric TestsNon Parametric Tests
Non Parametric Tests
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
Quantitative_analysis.ppt
Quantitative_analysis.pptQuantitative_analysis.ppt
Quantitative_analysis.ppt
 
Final analysis & Discussion_Volen
Final analysis & Discussion_VolenFinal analysis & Discussion_Volen
Final analysis & Discussion_Volen
 
Qualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic ModelQualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic Model
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021
 
Meta-Analysis in Ayurveda
Meta-Analysis in AyurvedaMeta-Analysis in Ayurveda
Meta-Analysis in Ayurveda
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
 
Kinds Of Variable
Kinds Of VariableKinds Of Variable
Kinds Of Variable
 
presentation
presentationpresentation
presentation
 

Statistical Analysis of Protein Mutations - Copy

  • 1. Developing Metrics to Discern Apparent Study Power in Protein Mutation Distributions Anna Blendermann Mentor: Arlin Stoltzfus
  • 2. Deep Mutational Scanning  Deep Mutational Scanning: technique that uses high throughput DNA sequencing to analyze protein mutations  Last month, an article appeared in Genetics with results on 2000 mutants of the BRCA1 gene, which is linked to breast cancer (http ://www.genetics.org/content/200/2/413)    
  • 3. The Inadequacy of Recent Studies Understanding the effects of mutations is a major challenge in genomics, evolution, and medicine Recent studies show…  An unprecedented amount of data on the effects of mutations in proteins  Unexplained differences in the power of studies to discern effects in mutations  For example, Lind’s analysis of ribosomal protein mutations shows little difference between missense and synonymous mutations. ( http://www.sciencemag.org/content/330/6005/825.abstract) UNDISCERNABLE GRAPH?! Lind Study
  • 4. GCC = alanine GUC = valine Different amino acid Missense Mutations  Missense mutations change amino acids  Largest frequency among the three effect types  Expected to cause a wide variety of effects
  • 5. CAG = glutamine UAG = stop condon Truncated protein CAA = valine CAT = valine Different codon, same amino acid Nonsense & Synonymous Mutations Synonymous mutations… - Change codons but not amino acids - Have very small, beneficial effects Nonsense mutations… - Truncate proteins - Have strong, deleterious effects
  • 6. Learning and Implementing R  My project required learning R and writing code for the development of analytical metrics
  • 7. Using Rstudio for Data Analysis  Rstudio was used to compile distribution graphs of missense, nonsense, and synonymous mutations, in stacked histogram form Firnberg Study Stacked Histogram Distributions
  • 8. Visualizing DMS Data with Fitness Fitness Distribution graphs are based on  Y-axis: frequency of protein mutations  X-axis: fitness level of the resulting cell Frequency – number of mutations Fitness – how fast the cell grows Nonsense Mutations Missense Mutations Synonymous Mutations
  • 9. Visualizing DMS Data with Quantiles Quantile Distribution graphs are based on  Y-axis: frequency of mutations relative to the total number of protein mutations  X-axis: fitness level of effect types relative to the overall fitness of the cell Frequency – number of mutations Fitness – how fast the cell grows Nonsense Mutations Synonymous Mutations Missense Mutations
  • 10. Standard deviation of synonymous mutations Difference of missense & nonsense averages Difference of synonymous & missense averages Difference of synonymous & nonsense averages Difference of nonsense average and min fitness Developing Metrics for Quality Analysis Five Metrics were developed to assess the quality of fitness and quantile distributions
  • 11. #1 • Compute metric values #2 • Get R^2 values from cross validation #3 • Plot metrics vs. R^2 values #4 • Graph linear regressions (best fit lines) #5 • Calculate P-values for each plot Completing Metric Analysis with Five Steps Metric Analysis – determining the ability of each metric to evaluate apparent study power Apparent Study Power – how well a distribution graph displays data Our Five Steps
  • 12. Computing Metrics for Nine Mutation Studies  We had 25 studies, but only 9 studies contained the effect types needed to calculate the metrics Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622 Carrasco 0.235649 0.275932 0.192387 0.46832 0 Dc_phi NA 0.294643 NA NA 0.101563 Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862 Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372 Peris 0.146945 0.306863 0.321471 0.628333 0 Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418 Sanjuan 0.245534 0.263043 0.273641 0.536685 0 Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869
  • 13. Getting R^2 Values from the Cross Validation R^2.values 0.14926564 0.15930005 0.58369074 0.18015482 0.22684849 0.17337046 0.21835122 0.18749149 0.14267328  R^2 values – mean quantile (exchangeability) values from each study that measure power on 0-1 scale VS Study Stan.dev Mis.non Syn.mis Syn.non Non.fitness Acevedo 0.250016 0.229262 0.234982 0.464244 0.080622 Carrasco 0.235649 0.275932 0.192387 0.46832 0 Dc_phi NA 0.294643 NA NA 0.101563 Firnberg 0.147203 0.159689 0.252223 0.411911 0.317862 Hietpas 0.051814 0.223672 0.419424 0.643095 0.268372 Peris 0.146945 0.306863 0.321471 0.628333 0 Roscoe 0.078261 0.379291 0.203929 0.58322 0.114418 Sanjuan 0.245534 0.263043 0.273641 0.536685 0 Wu_v1 0.199661 0.243483 0.279801 0.523284 0.17869
  • 14. Plot - Standard Deviation of Synonymous Mutations  X-Axis Values: r^2 values  Y-Axis Values: sd(synonymous)  Linear Regression Slope: negative
  • 15. Plot - Difference of Missense & Nonsense Averages  X-Axis Values: r^2 values  Y-Axis Values: avg(mis) – avg(non)  Linear Regression Slope: positive
  • 16. Plot - Difference of Synonymous & Missense Averages  X-Axis Values: r^2 values  Y-Axis Values: avg(syn) – avg(mis)  Linear Regression Slope: positive
  • 17. Plot - Difference of Synonymous & Nonsense Averages  X-Axis Values: r^2 values  Y-Axis Values: avg(syn) – avg(non)  Linear Regression Slope: positive
  • 18. Plot - Difference of Average Nonsense and Minimum Fitness  X-Axis Values: r^2 values  Y-Axis Values: avg(non) – min(fitness)  Linear Regression Slope: zero
  • 19. Correlating Metrics with Apparent Study Power  P-Values: values calculated from linear regression that measure the significance of correlation, lower values are better! Metric P-Value Standard Deviation of Synonymous Mutations 0.014051 Difference of Missense & Nonsense Averages 0.53634 Difference of Synonymous & Nonsense Averages 0.304701 Difference of Synonymous & Nonsense Averages 0.128621 Difference of Nonsense Average and Min Fitness 0.975549 Column1 Mis.non Syn.mis Syn.non Non.fitness 00.511.5 0.01 0.54 0.3 0.13 0.981 2 3 4 5 1 2 3 4 5 P-Value of Metrics P-Values P-Values P-Values
  • 20. 1. We developed one metric ideal for the quality analysis of mutation distributions: Standard Deviation of Synonymous Mutations 2. There were not enough studies with available data to create linear regressions that accurately evaluated the usability of each metric 3. We only tested five metrics, so there was already a 15%-20% chance that at least one P-value < 0.05 Future Work: develop MORE METRICS from the mutational data from MORE STUDIES, to help researchers accurately assess the quality of their studies and thus, better discern the effects of mutations in proteins Our Conclusions Based on the Metric Analysis
  • 21. Thank You Dr. Arlin Stoltzfus, Mentor Dr. Mary Satterfield, MML Chief of Staff Dr. Brandi Tolliver, NIST SURF Director The SURF Program

Editor's Notes

  1. &amp;lt;number&amp;gt;
  2. &amp;lt;number&amp;gt;
  3. &amp;lt;number&amp;gt;
  4. &amp;lt;number&amp;gt;
  5. &amp;lt;number&amp;gt;
  6. &amp;lt;number&amp;gt;
  7. &amp;lt;number&amp;gt;
  8. &amp;lt;number&amp;gt;
  9. &amp;lt;number&amp;gt;
  10. &amp;lt;number&amp;gt;
  11. &amp;lt;number&amp;gt;
  12. &amp;lt;number&amp;gt;
  13. &amp;lt;number&amp;gt;
  14. &amp;lt;number&amp;gt;
  15. &amp;lt;number&amp;gt;
  16. &amp;lt;number&amp;gt;
  17. &amp;lt;number&amp;gt;
  18. &amp;lt;number&amp;gt;
  19. &amp;lt;number&amp;gt;
  20. &amp;lt;number&amp;gt;
  21. Thank you &amp;lt;number&amp;gt;