This document discusses several common problems with data handling and quality including building and testing models with the same data, confusion between biological and technical replicates, and identification and handling of outliers. It provides examples and explanations of key concepts such as experimental and sampling units, pseudo-replication, outliers versus high influence points, and leverage plots. The importance of proper data handling techniques like dividing data into training, test, and confirmation sets and using cross-validation is emphasized to avoid overfitting models and generating spurious findings.
The document provides an overview of statistical testing, including:
- When to use parametric vs. nonparametric tests
- When large sample tests or exact tests are needed
- When adjustments for multiple testing are required
It discusses key concepts like null and alternative hypotheses, test statistics, p-values, and type I and II errors. Examples of the Student's t-test and Wilcoxon rank sum test are provided.
This document discusses methods for analyzing categorical data and response variables, including contingency tables, chi-square tests, Fisher's exact test, odds ratios, logistic regression, and generalized linear models. Contingency tables are used to display relationships between categorical variables and tests of independence. Fisher's exact test and chi-square tests determine if a relationship is statistically significant. Odds ratios and relative risk indicate the magnitude of relationships. Logistic regression models relationships between continuous predictors and categorical responses. Generalized linear models extend these methods.
Choosing appropriate statistical test RSS6 2104RSS6
This document discusses choosing appropriate statistical tests based on study design and data type. It covers descriptive studies that measure prevalence and incidence, as well as analytic studies like randomized controlled trials, cohort studies, and case-control studies. For data type, it discusses approaches for continuous and categorical variables, including t-tests, ANOVA, chi-square tests, and regression. It also discusses measures of disease frequency, effect, and impact like risk difference, risk ratio, and odds ratio.
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)Vaggelis Vergoulas
This document provides a step-by-step guide for choosing the appropriate statistical test for data analysis. It outlines 7 key steps: 1) determining if the analysis is univariate or multivariable, 2) identifying if the study examines differences or correlations, 3) determining if the data is paired or independent, 4) characterizing the type of outcome variable, 5) assessing the normality of distribution for continuous variables, 6) identifying the number of groups for independent variables, and 7) selecting valid statistical tests that match the characteristics identified in the previous steps, such as t-tests, ANOVA, regression analyses. Examples of applying this process are provided.
Assumptions of parametric and non-parametric tests
Testing the assumption of normality
Commonly used non-parametric tests
Applying tests in SPSS
Advantages of non-parametric tests
Limitations
1) The document discusses commonly used statistical tests in research such as descriptive statistics, inferential statistics, hypothesis testing, and tests like t-tests, ANOVA, chi-square tests, and normal distributions.
2) It provides examples of how to determine sample sizes needed for adequate power in hypothesis testing and how to perform t-tests to analyze sample means.
3) Key statistical concepts covered include parameters, statistics, measurement scales, type I and II errors, and interpreting results of hypothesis tests.
Statistical methods for the life sciences lbpriyaupm
This document discusses nonparametric statistical methods and rank tests. It begins with an introductory example comparing blood pressure readings before and after surgery using a paired t-test approach and discussing its assumptions. It then introduces nonparametric hypothesis testing as an alternative that does not rely on distributional assumptions. The document outlines what test to use based on factors like the number of groups and whether samples are independent or dependent. It provides detailed examples of the Wilcoxon rank sum test to compare two independent samples and the Kruskal-Wallis and Friedman tests for comparing more than two groups.
This document discusses parametric and non-parametric statistical tests. It begins by defining different types of data and the standard normal distribution curve. It then covers hypothesis testing, including the different types of errors. Both parametric and non-parametric tests are examined. Parametric tests discussed include z-tests, t-tests, and ANOVA, while non-parametric tests include chi-square, sign tests, McNemar's test, and Fischer's exact test. Examples are provided to illustrate several of the tests.
The document provides an overview of statistical testing, including:
- When to use parametric vs. nonparametric tests
- When large sample tests or exact tests are needed
- When adjustments for multiple testing are required
It discusses key concepts like null and alternative hypotheses, test statistics, p-values, and type I and II errors. Examples of the Student's t-test and Wilcoxon rank sum test are provided.
This document discusses methods for analyzing categorical data and response variables, including contingency tables, chi-square tests, Fisher's exact test, odds ratios, logistic regression, and generalized linear models. Contingency tables are used to display relationships between categorical variables and tests of independence. Fisher's exact test and chi-square tests determine if a relationship is statistically significant. Odds ratios and relative risk indicate the magnitude of relationships. Logistic regression models relationships between continuous predictors and categorical responses. Generalized linear models extend these methods.
Choosing appropriate statistical test RSS6 2104RSS6
This document discusses choosing appropriate statistical tests based on study design and data type. It covers descriptive studies that measure prevalence and incidence, as well as analytic studies like randomized controlled trials, cohort studies, and case-control studies. For data type, it discusses approaches for continuous and categorical variables, including t-tests, ANOVA, chi-square tests, and regression. It also discusses measures of disease frequency, effect, and impact like risk difference, risk ratio, and odds ratio.
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)Vaggelis Vergoulas
This document provides a step-by-step guide for choosing the appropriate statistical test for data analysis. It outlines 7 key steps: 1) determining if the analysis is univariate or multivariable, 2) identifying if the study examines differences or correlations, 3) determining if the data is paired or independent, 4) characterizing the type of outcome variable, 5) assessing the normality of distribution for continuous variables, 6) identifying the number of groups for independent variables, and 7) selecting valid statistical tests that match the characteristics identified in the previous steps, such as t-tests, ANOVA, regression analyses. Examples of applying this process are provided.
Assumptions of parametric and non-parametric tests
Testing the assumption of normality
Commonly used non-parametric tests
Applying tests in SPSS
Advantages of non-parametric tests
Limitations
1) The document discusses commonly used statistical tests in research such as descriptive statistics, inferential statistics, hypothesis testing, and tests like t-tests, ANOVA, chi-square tests, and normal distributions.
2) It provides examples of how to determine sample sizes needed for adequate power in hypothesis testing and how to perform t-tests to analyze sample means.
3) Key statistical concepts covered include parameters, statistics, measurement scales, type I and II errors, and interpreting results of hypothesis tests.
Statistical methods for the life sciences lbpriyaupm
This document discusses nonparametric statistical methods and rank tests. It begins with an introductory example comparing blood pressure readings before and after surgery using a paired t-test approach and discussing its assumptions. It then introduces nonparametric hypothesis testing as an alternative that does not rely on distributional assumptions. The document outlines what test to use based on factors like the number of groups and whether samples are independent or dependent. It provides detailed examples of the Wilcoxon rank sum test to compare two independent samples and the Kruskal-Wallis and Friedman tests for comparing more than two groups.
This document discusses parametric and non-parametric statistical tests. It begins by defining different types of data and the standard normal distribution curve. It then covers hypothesis testing, including the different types of errors. Both parametric and non-parametric tests are examined. Parametric tests discussed include z-tests, t-tests, and ANOVA, while non-parametric tests include chi-square, sign tests, McNemar's test, and Fischer's exact test. Examples are provided to illustrate several of the tests.
This document provides guidance on choosing appropriate statistical tests based on characteristics of the data and research questions. It outlines initial questions to consider, such as the number of samples, whether the data is parametric or non-parametric, and the number and independence of groups. Tables show which tests are suited for different data scales (nominal, ordinal, interval/ratio) and sample configurations (one sample, two independent samples, two related samples, more than two samples). The assumptions of various statistical tests like t-tests, ANOVA, chi-square, and correlation analyses are also reviewed.
Parmetric and non parametric statistical test in clinical trailsVinod Pagidipalli
The document discusses parametric and non-parametric statistical tests used in clinical trials. Parametric tests like the z-test, t-test, ANOVA, and correlation tests are used when data follows a normal distribution. Non-parametric tests like the chi-square test, Fisher's exact test, and binomial test are used when data cannot be assumed to be normally distributed. Several statistical tests are described, including how to apply them in clinical trials to compare treatment groups, analyze associations between variables, and test hypotheses about population proportions.
This document provides an outline and examples for teaching nonparametric tests. It discusses the differences between parametric and nonparametric tests, outlines four key nonparametric tests (Mann-Whitney U, Wilcoxon Signed-Rank, Kruskal-Wallis, Wilcoxon Rank Sum), and provides three examples applying these tests to analyze medical data comparing treatment groups or measuring values before and after an intervention. Exercises are also provided for students to practice applying these nonparametric tests.
This document provides an overview of non-parametric statistics. It defines non-parametric tests as those that make fewer assumptions than parametric tests, such as not assuming a normal distribution. The document compares and contrasts parametric and non-parametric tests. It then explains several common non-parametric tests - the Mann-Whitney U test, Wilcoxon signed-rank test, sign test, and Kruskal-Wallis test - and provides examples of how to perform and interpret each test.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
This document provides an overview of statistical analyses that can be performed in PRISM. It discusses how to perform common statistical tests like t-tests, ANOVA, linear regression, and summarizes the appropriate tests to address different research questions. Examples are given of how to analyze pre-post treatment data using paired t-tests and compare groups using independent t-tests or ANOVA. Guidance is also provided on interpreting results and checking assumptions.
Explains how to select a statistical test suitable for your hypothesis. Suggests points to consider before deciding about a test. Gives a list of commonly used parametric and non-parametric tests with their purposes of use.
This document discusses various statistical concepts including descriptive statistics, inferential statistics, univariate analysis, bivariate analysis, and multivariate analysis. It provides examples of common statistical tests like t-tests, correlation analysis, ANOVA, and discusses how to identify independent and dependent variables and choose appropriate parametric or non-parametric tests based on the nature of the variables and data. Key topics covered include descriptive versus inferential analysis, different types of correlations, using ANOVA to compare multiple groups, and how to determine assumptions and select the correct statistical test for different research questions and study designs.
The document discusses various non-parametric statistical tests that can be used to analyze data when the assumptions of parametric tests are not met. It provides examples of how each test can be used, including chi-square test, binomial test, runs test, Mann-Whitney U test, Kruskal-Wallis test, median test, Wilcoxon test, McNemar test, Friedman test, and Cochran's Q test. For each test, it describes a scenario and states which test should be used to analyze the corresponding data.
The document discusses choosing appropriate statistical tests for analyzing medical research studies. It provides an overview of commonly used statistical tests such as the t-test, chi-square test, Fisher's exact test, analysis of variance, and Wilcoxon rank sum test. The document outlines the key factors to consider when selecting a statistical test, such as the scale of measurement (continuous, categorical, binary) and study design (paired or unpaired). Algorithms and tables are provided to help readers identify the proper statistical test based on these characteristics.
The document discusses non-parametric statistical tests and provides examples of their use in SPSS. It introduces key non-parametric tests including the chi-square test, binomial test, run test, Kolmogorov-Smirnov test, Mann-Whitney U test, and Kruskal-Wallis H test. Each test is explained and an example is provided demonstrating how to conduct the test in SPSS, interpret the output, and determine if results are statistically significant. The document serves as a hands-on guide for using various non-parametric tests to analyze data when parametric assumptions are not met.
The document discusses non-parametric tests and provides information about when to use them. Non-parametric tests make fewer assumptions about the distribution of population values and can be used when sample sizes are small or the data is ordinal. Examples of non-parametric tests provided include the sign test, chi-square test, Mann-Whitney U test, and Kruskal-Wallis test. The general steps to perform a non-parametric test are also outlined.
Commonly Used Statistics in Medical Research Part IPat Barlow
This presentation covers a brief introduction to some of the more common statistical analyses we run into while working with medical residents. The point is to make the audience familiar with these statistics rather than calculate them, so it is well-suited for journal clubs or other EBM-related sessions. By the end of this presentation the students should be able to: Define parametric and descriptive statistics
• Compare and contrast three primary classes of parametric statistics: relationships, group differences, and repeated measures with regards to when and why to use each
• Link parametric statistics with their non-parametric equivalents
• Identify the benefits and risks associated with using multivariate statistics
• Match research scenarios with the appropriate parametric statistics
The presentation is accompanied with the following handout: http://slidesha.re/1178weg
This document discusses various statistical tests that can be used to analyze data, including parametric and nonparametric tests. It provides examples of tests for comparing means from one group, two independent groups, two related groups, and multiple groups. These include t-tests, ANOVA, correlation, regression, chi-square tests, and nonparametric alternatives like the Wilcoxon and Kruskal-Wallis tests. It also discusses assumptions of parametric versus nonparametric tests and selecting the appropriate test based on the scale of the data.
This document provides an overview of statistics concepts for a pharmacy course. It discusses topics like variables, populations and samples, levels of measurement for data, types of studies like randomized controlled trials, and key steps to planning a study. The document is intended to cover fundamental statistical concepts and their applications in pharmaceutical research and clinical trials.
Bio statistics 2 /certified fixed orthodontic courses by Indian dental academy Indian dental academy
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and offering a wide range of dental certified courses in different formats.
Indian dental academy provides dental crown & Bridge,rotary endodontics,fixed orthodontics,
Dental implants courses.for details pls visit www.indiandentalacademy.com ,or call
0091-9248678078
The document provides definitions and information about biostatistics including:
1. Biostatistics is the branch of statistics dealing with the application of statistical methods to health sciences data. It is used for collecting, presenting, analyzing, and interpreting data to make decisions.
2. The goals of studying biostatistics include conducting investigations, research management, making inferences from samples, understanding valid statistical claims, and evaluating health programs.
3. There are two main branches of statistics - descriptive statistics which summarizes data, and inferential statistics which makes generalizations about populations from samples through estimation and hypothesis testing.
This document discusses the 6BIO8 biology exam paper and provides guidance on statistical tests that may be covered. It outlines the three questions on the exam: 1) based on practical experiments, 2) statistical tests like correlation, t-test, chi-square, Mann-Whitney U, and Wilcoxon, 3) planning an investigation. It then explains how to perform and interpret the statistical tests of correlation, t-test, chi-square, Mann-Whitney U, and Wilcoxon paired test. Guidance is provided on null hypotheses, critical values, and interpreting results for each test.
Parametric vs Nonparametric Tests: When to use whichGönenç Dalgıç
There are several statistical tests which can be categorized as parametric and nonparametric. This presentation will help the readers to identify which type of tests can be appropriate regarding particular data features.
This document provides an overview of non-parametric tests presented by Ms. Prajakta Sawant. It discusses non-parametric tests as distribution-free statistical tests that do not require assumptions about the underlying population distribution. Common non-parametric tests described include the Wilcoxon rank-sum test, Kruskal-Wallis test, Spearman's rank correlation coefficient, and the chi-square test. Examples are provided for each test to illustrate their application and interpretation.
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
This document discusses key concepts related to sampling theory and measurement in research studies. It defines important sampling terms like population, sampling criteria, sampling methods, sampling error and bias. It also covers levels of measurement, reliability, validity and various measurement strategies like physiological measures, observations, interviews, questionnaires and scales. Finally, it provides an overview of statistical analysis techniques including descriptive statistics, inferential statistics, the normal curve and common tests like t-tests, ANOVA, and regression analysis.
This document provides guidance on choosing appropriate statistical tests based on characteristics of the data and research questions. It outlines initial questions to consider, such as the number of samples, whether the data is parametric or non-parametric, and the number and independence of groups. Tables show which tests are suited for different data scales (nominal, ordinal, interval/ratio) and sample configurations (one sample, two independent samples, two related samples, more than two samples). The assumptions of various statistical tests like t-tests, ANOVA, chi-square, and correlation analyses are also reviewed.
Parmetric and non parametric statistical test in clinical trailsVinod Pagidipalli
The document discusses parametric and non-parametric statistical tests used in clinical trials. Parametric tests like the z-test, t-test, ANOVA, and correlation tests are used when data follows a normal distribution. Non-parametric tests like the chi-square test, Fisher's exact test, and binomial test are used when data cannot be assumed to be normally distributed. Several statistical tests are described, including how to apply them in clinical trials to compare treatment groups, analyze associations between variables, and test hypotheses about population proportions.
This document provides an outline and examples for teaching nonparametric tests. It discusses the differences between parametric and nonparametric tests, outlines four key nonparametric tests (Mann-Whitney U, Wilcoxon Signed-Rank, Kruskal-Wallis, Wilcoxon Rank Sum), and provides three examples applying these tests to analyze medical data comparing treatment groups or measuring values before and after an intervention. Exercises are also provided for students to practice applying these nonparametric tests.
This document provides an overview of non-parametric statistics. It defines non-parametric tests as those that make fewer assumptions than parametric tests, such as not assuming a normal distribution. The document compares and contrasts parametric and non-parametric tests. It then explains several common non-parametric tests - the Mann-Whitney U test, Wilcoxon signed-rank test, sign test, and Kruskal-Wallis test - and provides examples of how to perform and interpret each test.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
This document provides an overview of statistical analyses that can be performed in PRISM. It discusses how to perform common statistical tests like t-tests, ANOVA, linear regression, and summarizes the appropriate tests to address different research questions. Examples are given of how to analyze pre-post treatment data using paired t-tests and compare groups using independent t-tests or ANOVA. Guidance is also provided on interpreting results and checking assumptions.
Explains how to select a statistical test suitable for your hypothesis. Suggests points to consider before deciding about a test. Gives a list of commonly used parametric and non-parametric tests with their purposes of use.
This document discusses various statistical concepts including descriptive statistics, inferential statistics, univariate analysis, bivariate analysis, and multivariate analysis. It provides examples of common statistical tests like t-tests, correlation analysis, ANOVA, and discusses how to identify independent and dependent variables and choose appropriate parametric or non-parametric tests based on the nature of the variables and data. Key topics covered include descriptive versus inferential analysis, different types of correlations, using ANOVA to compare multiple groups, and how to determine assumptions and select the correct statistical test for different research questions and study designs.
The document discusses various non-parametric statistical tests that can be used to analyze data when the assumptions of parametric tests are not met. It provides examples of how each test can be used, including chi-square test, binomial test, runs test, Mann-Whitney U test, Kruskal-Wallis test, median test, Wilcoxon test, McNemar test, Friedman test, and Cochran's Q test. For each test, it describes a scenario and states which test should be used to analyze the corresponding data.
The document discusses choosing appropriate statistical tests for analyzing medical research studies. It provides an overview of commonly used statistical tests such as the t-test, chi-square test, Fisher's exact test, analysis of variance, and Wilcoxon rank sum test. The document outlines the key factors to consider when selecting a statistical test, such as the scale of measurement (continuous, categorical, binary) and study design (paired or unpaired). Algorithms and tables are provided to help readers identify the proper statistical test based on these characteristics.
The document discusses non-parametric statistical tests and provides examples of their use in SPSS. It introduces key non-parametric tests including the chi-square test, binomial test, run test, Kolmogorov-Smirnov test, Mann-Whitney U test, and Kruskal-Wallis H test. Each test is explained and an example is provided demonstrating how to conduct the test in SPSS, interpret the output, and determine if results are statistically significant. The document serves as a hands-on guide for using various non-parametric tests to analyze data when parametric assumptions are not met.
The document discusses non-parametric tests and provides information about when to use them. Non-parametric tests make fewer assumptions about the distribution of population values and can be used when sample sizes are small or the data is ordinal. Examples of non-parametric tests provided include the sign test, chi-square test, Mann-Whitney U test, and Kruskal-Wallis test. The general steps to perform a non-parametric test are also outlined.
Commonly Used Statistics in Medical Research Part IPat Barlow
This presentation covers a brief introduction to some of the more common statistical analyses we run into while working with medical residents. The point is to make the audience familiar with these statistics rather than calculate them, so it is well-suited for journal clubs or other EBM-related sessions. By the end of this presentation the students should be able to: Define parametric and descriptive statistics
• Compare and contrast three primary classes of parametric statistics: relationships, group differences, and repeated measures with regards to when and why to use each
• Link parametric statistics with their non-parametric equivalents
• Identify the benefits and risks associated with using multivariate statistics
• Match research scenarios with the appropriate parametric statistics
The presentation is accompanied with the following handout: http://slidesha.re/1178weg
This document discusses various statistical tests that can be used to analyze data, including parametric and nonparametric tests. It provides examples of tests for comparing means from one group, two independent groups, two related groups, and multiple groups. These include t-tests, ANOVA, correlation, regression, chi-square tests, and nonparametric alternatives like the Wilcoxon and Kruskal-Wallis tests. It also discusses assumptions of parametric versus nonparametric tests and selecting the appropriate test based on the scale of the data.
This document provides an overview of statistics concepts for a pharmacy course. It discusses topics like variables, populations and samples, levels of measurement for data, types of studies like randomized controlled trials, and key steps to planning a study. The document is intended to cover fundamental statistical concepts and their applications in pharmaceutical research and clinical trials.
Bio statistics 2 /certified fixed orthodontic courses by Indian dental academy Indian dental academy
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and offering a wide range of dental certified courses in different formats.
Indian dental academy provides dental crown & Bridge,rotary endodontics,fixed orthodontics,
Dental implants courses.for details pls visit www.indiandentalacademy.com ,or call
0091-9248678078
The document provides definitions and information about biostatistics including:
1. Biostatistics is the branch of statistics dealing with the application of statistical methods to health sciences data. It is used for collecting, presenting, analyzing, and interpreting data to make decisions.
2. The goals of studying biostatistics include conducting investigations, research management, making inferences from samples, understanding valid statistical claims, and evaluating health programs.
3. There are two main branches of statistics - descriptive statistics which summarizes data, and inferential statistics which makes generalizations about populations from samples through estimation and hypothesis testing.
This document discusses the 6BIO8 biology exam paper and provides guidance on statistical tests that may be covered. It outlines the three questions on the exam: 1) based on practical experiments, 2) statistical tests like correlation, t-test, chi-square, Mann-Whitney U, and Wilcoxon, 3) planning an investigation. It then explains how to perform and interpret the statistical tests of correlation, t-test, chi-square, Mann-Whitney U, and Wilcoxon paired test. Guidance is provided on null hypotheses, critical values, and interpreting results for each test.
Parametric vs Nonparametric Tests: When to use whichGönenç Dalgıç
There are several statistical tests which can be categorized as parametric and nonparametric. This presentation will help the readers to identify which type of tests can be appropriate regarding particular data features.
This document provides an overview of non-parametric tests presented by Ms. Prajakta Sawant. It discusses non-parametric tests as distribution-free statistical tests that do not require assumptions about the underlying population distribution. Common non-parametric tests described include the Wilcoxon rank-sum test, Kruskal-Wallis test, Spearman's rank correlation coefficient, and the chi-square test. Examples are provided for each test to illustrate their application and interpretation.
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
This document discusses key concepts related to sampling theory and measurement in research studies. It defines important sampling terms like population, sampling criteria, sampling methods, sampling error and bias. It also covers levels of measurement, reliability, validity and various measurement strategies like physiological measures, observations, interviews, questionnaires and scales. Finally, it provides an overview of statistical analysis techniques including descriptive statistics, inferential statistics, the normal curve and common tests like t-tests, ANOVA, and regression analysis.
This document discusses key concepts related to sampling and measurement in research. It covers topics such as population and sampling criteria when selecting a sample. It also discusses levels of measurement, reliability, validity, and different measurement strategies like interviews, questionnaires, and scales. Finally, it provides an overview of statistical analysis, including descriptive statistics, levels of measurement, and common statistical tests. The overall purpose is to introduce fundamental concepts for designing research studies and analyzing quantitative data.
This document provides an overview of key concepts in statistics and experimental design. It discusses what data is and how it is collected. It introduces variables, relationships between variables, and data matrices. It describes the differences between observational studies and experiments, and discusses sources of bias and different sampling methods. It covers principles of experimental design such as controlling variables, randomization, replication, and blocking. Finally, it distinguishes between random sampling and random assignment in experiments.
Non parametric study; Statistical approach for med student Dr. Rupendra Bharti
Non-parametric statistics are statistical methods that do not rely on assumptions about the probability distributions of the variables being assessed. They make fewer assumptions than parametric tests and can be used with ordinal or nominal data. Some common non-parametric tests include the chi-square test, McNemar's test, sign test, Wilcoxon signed-rank test, Mann-Whitney U test, Kruskal-Wallis test, and Spearman's rank correlation test. Non-parametric tests are useful when the data is ranked or does not meet the assumptions of parametric tests, as they provide a distribution-free way to perform statistical hypothesis testing.
Introduction to Data Management in Human EcologyKern Rocke
This document provides an introduction to data management concepts in human ecology. It defines data and describes common data types like qualitative and quantitative data. It also discusses topics like sources of data, types of statistical analyses, strategies for computer-aided analysis, principles of statistical analysis, and interpreting p-values. Examples of statistical programs and various statistical analysis methods for comparing groups and exploring relationships between variables are also outlined.
Worked examples of sampling uncertainty evaluationGH Yeoh
ISO/IEC 17025:2017 laboratory accreditation standard has expanded its requirement for measurement uncertainty to include both sampling and analytical uncertainties.
The document provides an overview of key concepts related to estimation in statistics, including:
- Estimation involves using sample data to estimate unknown population parameters. Common estimators include the sample mean, proportion, and standard deviation.
- There are two main types of estimates - point estimates and interval estimates. Point estimates are single values while interval estimates specify a range.
- The process of estimation involves identifying the parameter, selecting a random sample, choosing an estimator, and calculating the estimate.
- Estimates can differ from the true population value due to sampling error and non-sampling error. Bias occurs when the expected value of the estimate differs from the true parameter value.
How predictive models help Medicinal Chemists design better drugs_webinarAnn-Marie Roche
All scientific disciplines, including medicinal chemistry, are experiencing a revolution in unprecedented rates of data being generated and the subsequent analysis and exploitation of this data is increasingly fundamental to innovation. Using data to design better compounds is a challenge for Medicinal and Computational chemists.
The design of small-molecule drug candidates, encompassing characteristics such as potency, selectivity and ADMET (absorption, distribution, metabolism, excretion and toxicity) is a key factor in the success of clinical trials and computer-aided drug discovery/design methods have played a major role in the development of therapeutically important small molecules for over three decades. These methods are broadly classified as either structure-based or ligand-based.
In this webinar our expert Dr. Olivier Barberan will discuss ligand-based methods and he will cover the following:
How to use only ligand information to predict activity depending on its similarity/dissimilarity to previously known active ligands.
- Discuss ligand-based pharmacophores, molecular descriptors, and quantitative structure-activity relationships and important tools such as target/ligand databases necessary for successful implementation of various computer-aided drug discovery/design methods in a drug discovery campaign.
Clinical prediction models:development, validation and beyondMaarten van Smeden
This document appears to be a slide deck on the topic of clinical prediction models. It discusses:
- The differences between explanatory, predictive, and descriptive models.
- Challenges with predictive models like overfitting and the need for shrinkage methods.
- Sample size criteria like events per variable (EPV) and challenges validating models with low EPV.
- Methods for validating predictive performance like apparent, internal, and external validation and quantifying optimism.
- Additional validation strategies like bootstrapping and the importance of assessing calibration.
This document provides an introduction to choosing regression models. It discusses basic considerations like determining the purpose of the model, choosing appropriate predictors, and whether predictors or the outcome need transformation. Temporal sequence and prior knowledge are important factors in choosing predictors. The type of data, case ascertainment, and results of model fitting also influence predictor choice. Transforming predictors or the outcome can improve the model fit in some cases. The key is using statistical tools together with experience and understanding, not as a substitute for scientific insight.
Elashoff approach section in grant applicationsUCLA CTSI
This document provides guidance on how to write the "Approach" section of an R grant application. It discusses including preliminary data to demonstrate expertise and support for hypotheses. The study design should describe the overall design, endpoints, study population, inclusion/exclusion criteria, and measures. Sample size calculations must provide sufficient power and account for dropouts and multiple comparisons. Overall, the Approach section must convince reviewers that the study hypotheses could be true and the research team is capable of carrying out the study.
This document discusses collecting sample data from populations. It defines key terms like population, sample, census, and observational study vs experiment. It describes different levels of data measurement and types of data. Random sampling methods like simple random sampling are described as the gold standard. Other sampling techniques including systematic, stratified, cluster, and convenience are covered. The document discusses experimental design concepts like replication, blinding, and randomization. It also addresses observational study designs and controlling variables. Sources of error in sampling like sampling error and nonresponse are identified.
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptxssuserd509321
The document discusses factors that affect sample size calculation in different study designs. It provides examples of calculating sample sizes for descriptive cross-sectional studies, case-control studies, cohort studies, comparative studies, and randomized controlled trials. The key factors discussed are the level of confidence, power, expected proportions or means in groups, margin of error, and standard deviation. Sample size is affected by the type of study design, variables being qualitative or quantitative, and the goal of establishing equivalence, superiority or non-inferiority between groups. Electronic resources are provided for calculating sample sizes.
The document provides an overview of quantitative data analysis methods for medical research. It discusses statistical thinking, analytic methods including determining appropriate statistical tests based on the type of data and research question. It also covers designing a data collection plan, including determining data sources, capture and storage, as well as sample size determination. The key topics covered are choosing the right statistical tests, developing a robust data collection process, and ensuring adequate sample sizes.
This document summarizes quantitative data analysis techniques for summarizing data from samples and generalizing to populations. It discusses variables, simple and effect statistics, statistical models, and precision of estimates. Key points covered include describing data distribution through plots and statistics, common effect statistics for different variable types and models, ensuring model fit, and interpreting precision, significance, and probability to generalize from samples.
This document discusses sample design and the t-test. It covers the sample design process which includes defining the population, sample frame, sample size, and sampling procedure. It also discusses probability and non-probability sampling techniques. The document then explains what a t-test is and how it can be used to test for differences between two group means. It covers the assumptions, procedures, hypotheses, and interpretation of t-test results.
Similar to Overview of statistical tests: Data handling and data quality (Part II) (20)
Here are the steps to visualize a potential indel region after realignment:
1. Run GATK IndelRealigner on the target list:
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T IndelRealigner -R ../human_g1k_v37.fasta -I sample.dedup.bam -targetIntervals sample.intervals -o sample.realigned.bam
2. Index the realigned BAM:
samtools index sample.realigned.bam
3. Load the realigned BAM into IGV and navigate to a region of interest from the target list (sample.intervals).
4. In I
This document discusses phylogenetic analysis and tree building. It introduces the Bioinformatics and Computational Biology Branch (BCBB) group and their work analyzing biological sequences and constructing phylogenetic trees. The document explains why biological sequences are important to analyze and compares sequences to understand relatedness and evolution. It also covers multiple sequence alignment, substitution models, and algorithms for building trees, including neighbor-joining.
The webinar covered new features and updates to the Nephele 2.0 bioinformatics analysis platform. Key updates included a new website interface, improved performance through a new infrastructure framework, the ability to resubmit jobs by ID, and interactive mapping file submission. New pipelines for 16S analysis using DADA2 and quality control preprocessing were introduced, and the existing 16S mothur pipeline was updated. The quality control pipeline provides tools to assess data quality before running microbiome analyses through FastQC, primer/adapter trimming with cutadapt, and additional quality filtering options. The webinar emphasized the importance of data quality checks and highlighted troubleshooting tips such as examining the log file for error messages when jobs fail.
1) METAGENOTE is a new web-based tool for annotating genomic samples and submitting metadata and sequencing files to the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI).
2) It provides templates and controlled vocabularies to streamline sample metadata annotation using existing ontologies and standards. This allows for easier cross-study comparisons.
3) The demonstration showed how to use METAGENOTE's interface to annotate a mouse ear skin sample with terms from relevant ontologies, import additional annotations in batch, and submit the metadata and files to NCBI SRA through a 5-step wizard.
This document provides an introduction to homology modeling using computational tools like I-TASSER and Phyre2. It discusses how homology modeling can be used to generate 3D structural models of proteins when an experimental structure is not available. The document addresses common questions from users and outlines the I-TASSER modeling pipeline. Hands-on exercises are provided to allow users to run homology modeling tools and examine the resulting models.
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
Homology modeling is a computational technique for predicting the structure of a protein target based on its sequence similarity to proteins with known structures, and it involves finding a suitable template, aligning the target and template sequences, building a 3D model of the target, and evaluating the model quality. While experimental methods like X-ray crystallography and NMR can determine protein structures, they have limitations in terms of which proteins can be studied, so computational methods like homology modeling are needed to predict structures for the many proteins whose structures remain unknown.
The document discusses function prediction for unknown proteins. It begins with an overview of common methods for function prediction, including sequence and structure similarity, domains and motifs, gene expression, and interactions. It then uses a protein called Msa as a case study, analyzing it with various tools and finding evidence it may function as a signal transducer in bacterial response to environment. Finally, it briefly discusses another protein M46 and challenges in evaluating prediction accuracy.
This presentation discusses protein structure prediction using Rosetta. It begins with an overview of the Critical Assessment of Protein Structure Prediction (CASP) experiments and notes that Rosetta is one of the top performing free-modeling servers. The presentation then describes the basic ab initio protocol used by Rosetta, which involves fragment insertion, scoring, and refinement. It also discusses limitations and success rates. Key aspects of the Rosetta energy functions and sampling algorithms are presented. Examples of specific Rosetta applications including low-resolution modeling and refinement are provided.
This document provides an outline for a presentation on biological networks, including introducing biological networks, describing their basic components and types, methods for predicting and building networks, sources of interaction data, tools for network visualization and analysis, and a demonstration of building, visualizing and analyzing biological networks using Cytoscape. The presentation covers topics like nodes and edges in networks, features used to analyze networks, methods for predicting networks from sequences and omics data, integrated databases for interaction data, and popular tools for searching, visualizing and performing network analysis.
This document provides an overview and introduction to using the command line interface and submitting jobs to the NIAID High Performance Computing (HPC) Cluster. The objectives are to learn basic Unix commands, practice file manipulation from the command line, and submit a job to the HPC cluster. The document covers topics such as the anatomy of the terminal, navigating directories, common commands, tips for using the command line more efficiently, accessing and mounting drives on the HPC cluster, and an overview of the cluster queue system.
1) JMP is statistical software that allows for easy import, organization, and analysis of data. It features spreadsheet-like data tables, powerful statistical modeling capabilities, and customizable graphics.
2) The document reviews various features of JMP including importing data, organizing data tables, performing statistical analyses through platforms like distribution and fit model, and creating graphs and reports.
3) Assistance is available for using JMP through free training, support contacts, and detailed help menus within the software. JMP allows for both simple and advanced statistical analysis of data.
This document provides a training manual on better graphics in R. It begins with an overview of R and BioConductor and reviews basic R functions. It then covers creating simple and customized graphics, multi-step graphics with legends, and multi-panel layouts. The manual aims to help researchers learn visualization techniques to improve the communication of their data and results.
This document describes two web tools that were created using R to automate biostatistics workflows: HDX NAME and DRAP. HDX NAME analyzes hydrogen-deuterium exchange mass spectrometry data to estimate protein flexibility. It computes protection factors, compares groups, and maps results to protein structures. DRAP fits logistic dose-response curves to drug screening data from multiple plates. It automates curve fitting, compares results, and exports summaries. Both tools were created with R on the backend for analysis and web interfaces for usability. This allows researchers to perform complex analyses without programming expertise.
This document summarizes a presentation on curve fitting using GraphPad Prism. It discusses nonlinear regression techniques for analyzing dose-response and binding curve data commonly used by biologists. Specific nonlinear regression models like sigmoidal dose-response curves are described. The document provides guidance on choosing and fitting appropriate models, evaluating model fit, and improving model fit if needed.
This document provides solutions to sample problems using various datasets. It demonstrates how to use R functions like bargraph.CI(), boxplot(), hist(), and table() to analyze and visualize data. For example, it shows how to create bar charts comparing mean BMI by gender and mean AFP difference by drug concentration using the bargraph.CI() function from the sciplot package. It also provides solutions for manipulating datasets, such as recoding a variable or sorting and subsetting data.
This document provides a training manual for using R and BioConductor. It introduces R as a powerful open source software for statistical analysis and data visualization that also includes a scripting language. BioConductor is a related open source project that provides tools for analyzing genomic data using R packages. The manual then covers downloading and installing R and BioConductor, describes different interfaces for using R, and provides tutorials on basic R functions for data manipulation, graphics, statistics, and scripting.
This document summarizes features of the Prism graphing software for customizing graphs. It discusses how to customize various graph types like scatter plots, bar charts, and box plots by changing axes, colors, labels, and more. It also covers exporting graphs, cloning graphs to create new figures easily, and using templates to store common graph types and analyses. Upcoming seminars on curve fitting and statistical applications in Prism are also listed.
This document provides an overview of design of experiments (DOE). It discusses key concepts like controlled and uncontrolled inputs, response variables, experimental and sampling units, and different types of statistical designs. Specifically, it explains that a designed experiment involves planned statistical considerations to increase efficiency. It also describes completely randomized designs and randomized complete block designs as two basic statistical designs. The goal of DOE is to obtain unbiased and efficient experimental results.
More from Bioinformatics and Computational Biosciences Branch (20)
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Overview of statistical tests: Data handling and data quality (Part II)
1. Overview of Statistical Tests II:Overview of Statistical Tests II:
Data Handling and Data Quality
Presented by: Jeff Skinner M SPresented by: Jeff Skinner, M.S.
Biostatistics Specialist
Bioinformatics and Computational Biosciences Branch
National Institute of Allergy and Infectious Diseases
Office of Cyber Infrastructure and Computational BiologyOffice of Cyber Infrastructure and Computational Biology
2. How Should I Handle My Data?o S ou d a d e y a a?
Three common problems:Three common problems:
• Building and testing a model with the same data
Stepwise model building procedures and similar methods Stepwise model building procedures and similar methods
Not using cross-validation or similar methods
• Confusion between biological and technicalg
replicates
Pseudo-replication
• Identification and handling of outliers
Outliers vs. high influence points
Outlier removal vs robust statistical methodsOutlier removal vs. robust statistical methods
3. Building and Testing a Model
with the Same Data
• When do we encounter the problem?
Using simple tests to inform complicated tests
U i d l l ti t h i Using model selection techniques
• What are the negative effects?
Choosing poor models or “overfitting”
• How do we avoid these problems?
Using designed experiments
Training, Testing and Confirmation data sets
Cross-validation techniquesCross validation techniques
4. Simple Tests Inform Complex Tests
• Suppose you want to model
the factors influencing thethe factors influencing the
severity of some disease
• It seems sensible to test all
the variables individually,
then test a larger model ofthen test a larger model of
only the significant effects
What are the potential
Variable Test P‐value
Region (hospital) Chi‐Square Test 0.0001
Gender Chi‐Square Test 0 073
• What are the potential
problems with this method?
Gender Chi‐Square Test 0.073
Age Logistic Regression 0.0043
Weight Logistic Regression 0.1674
Percent Body Fat Logistic Regression 0.0623
Sodium levels Logistic Regression 0 1049Sodium levels Logistic Regression 0.1049
Cholesterol Logistic Regression 0.000495
5. Over-fitting from Simple TestsOver-fitting from Simple Tests
Individual Tests
bl l
Multivariate Model
Variable P‐value
Region (hospital) 0.4281
Gender 0.0367
Age 0.0043
W i ht 0 1674
Variable P‐value
Gender 0.0447
Age 0.0106
Cholesterol 0.0032
Weight 0.1674
Percent Body Fat 0.2623
Sodium levels 0.1049
Cholesterol 0.0004
Gender * Age 0.1872
Gender * Cholesterol 0.3388
Age * Cholesterol 0.6763
Gender * Age * Cholesterol 0.8961
• Because the variables are significant in the individual tests,
they should be significant in the multivariate model
Some results from individual tests may be false positives• Some results from individual tests may be false positives
• Because we use the same data to test the multivariate model,
the same false positives will be found in its results
6. Simpson’s Paradox
Individual Tests
Variable P‐value
(h l)
Multivariate Model
Variable P‐value
Region (hospital) 0.4281
Gender 0.5367
Age 0.0043
Weight 0.1674
P t B d F t 0 2623
Gender 0.0447
Age 0.0106
Cholesterol 0.0032
Gender * Age 0.0229
Percent Body Fat 0.2623
Sodium levels 0.1049
Cholesterol 0.0004
Gender * Cholesterol 0.3388
Age * Cholesterol 0.6763
Gender * Age * Cholesterol 0.8961
• Sometimes the relationship between two variables changes
in the presence of a third variable. This is Simpson’s
paradoxparadox
• If individual tests are used to build a multivariate model, then
sometimes important variables will be omitted because their
significance was obscured by an interaction effectsignificance was obscured by an interaction effect
7. Model Selection MethodsModel Selection Methods
• Goal is to identify the optimal number of variables
and the best choice of variables for a multivariableand the best choice of variables for a multivariable
model using a data set with dozens of possible
variables
• Step-wise selection methods
Backwards selection: start with all variables, then remove any
unneededunneeded
Forwards selection: start with no variables, then add the best
variables
Mixed selection: variables can be added or removed from model
• Best subsets or all subsets methods
Fit all possible models, then identify the best models by some criteria
8. Model Selection Criteria
• P-values of each potential X-variable
I di id l l hi hl iti t th i bl Individual p-values are highly sensitive to other variables
Individual p-value don’t really test the hypothesis of interest
• R2 and adjusted-R2j
Represent the percent of variation explained by the model
Meaningless or misleading if model assumptions are not met
Akaike’s Information Criteria (AIC)• Akaike’s Information Criteria (AIC)
Computed as 2k – 2ln(L)
Function of the log-likelihood and number of parametersg p
• Mallow’s Cp
Computed as Cp = SSEp / MSEk – N + 2P
Intended to address the issue of model over fitting Intended to address the issue of model over-fitting
9. Model Selection Methods
• Model selection methods
find the optimal variables
for a multivariate model
Optimal number of variablesOptimal number of variables
Identity of the variables
• Model selection methods
sometimes use p-values as
selection criteria but theseselection criteria, but these
p-values should not be
used for hypothesis tests
10. Problems With Model SelectionProblems With Model Selection
• P-values do not test the real hypothesis of interestyp
Model selection seeks to identify the optimal number of variables
H0: k = 0 Ha: k > 0 where k = # variables
Individual p-values are computed for all possible combinations of Individual p-values are computed for all possible combinations of
variables, most of which are not in the final model
• Individual p-values are computed from multiple tests
Individual p-values would need a strict adjustment for multiple
testing
Final p-values unlikely to be statistically significant
• Data driven hypotheses
It is unfair to peek at the data, then only test the largest differences
More likely to generate false positivesMore likely to generate false positives
11. Data Mining AnalysesData Mining Analyses
• Make predictions from VERY LARGE data sets
Mi d t ti i (NGS) d t Microarray and next generation sequencing (NGS) data
Large databases of clinical or medical records
Credit, banking and financial data, g
• Special classification models used to accommodate
large samples sizes or large number of variables
Classification and regression trees (CART)
K nearest neighbors (KNN) methods K-nearest neighbors (KNN) methods
Neural Nets, support vector machines (SVN), …
12. Training a Data Mining ModelTraining a Data Mining Model
• Researchers often want to compare several dataResearchers often want to compare several data
mining methods to find the best classifier
CART methods versus KNN methods
SVN versus neural nets
M d t i i d l h t th t• Many data mining models have parameters that
must be optimized for each problem
How many branches or splits for a CART?How many branches or splits for a CART?
How many neighbors for KNN?
13. An Example from Data Miningp g
Training Data Test DataTraining Data
Misclassifies 2 data points
Test Data
Misclassifies 6 data points
14. An Example from Data Mining
T i i D tTraining Data
Misclassifies 0 data points
Test Data
Misclassifies 5 data points
15. How Do We Avoid Problems?How Do We Avoid Problems?
Divide our data into two or three groups:
• Training data
Build a model using individual tests or model selection
Train a data mining model to identif optimal parameters Train a data mining model to identify optimal parameters
• Test data
Evaluate the model built with the training dataEvaluate the model built with the training data
Perform hypothesis tests
• Confirmation data
Evaluate the model built with the training data
Confirm findings from Test data set
16. Cross-validation MethodsCross validation Methods
• Divide data into slices,1 Train
then train and test
models
Train model with slice #1,
1
2
3
Train
test with slices 2, …, 8
Train model with slice #2,
test with slices 1, 3, …, 8
4
5
6
Test
…
Train model with slice #8,
test with slices 1, …, 7
C il lt t
7
8
• Compile results to
evaluate the fit of all 8
models
17. Biological or Technical Replicates?Biological or Technical Replicates?
• How do I analyze data if I pool samples?How do I analyze data if I pool samples?
• How do I analyze data if I use replicate samples?• How do I analyze data if I use replicate samples?
• What if I take multiple measurements from the• What if I take multiple measurements from the
same patient or subject?
• What if I run experiments on a cell line?
18. Experimental Units vs. Sampling Unitsg
• A treatment is a unique combination of all the factors andq
covariates studied in your experiment
Th i t l it (EU) i th ll t tit th t• The experimental unit (EU) is the smallest entity that
can receive or accept one treatment combination
• The sampling unit (SU) is the smallest entity that will be
measured or observed in the experiment
• Experimental and sampling units are not always the same
19. Example: EU and SU are the SameExample: EU and SU are the Same
• Suppose 20 patients have the common cold
10 patients are randomly chosen to take a new drug
10 patients are randomly chosen for the placebo
Duration of their symptoms (hours) is the response variableDuration of their symptoms (hours) is the response variable
• EU and SU are the same in this experimentEU and SU are the same in this experiment
Drug and placebo treatments are applied to each patient
Each patient is sampled to record their duration of symptoms
f S Therefore EU = patient and SU = patient
20. Example: EU and SU are differentExample: EU and SU are different
• 20 flowers are planted in individual pots
10 fl d l h t i d f tili ll t 10 flowers are randomly chosen to receive dry fertilizer pellets
10 flowers are randomly chosen to receive liquid fertilizer
All six petals are harvested from each flower and petal lengthp p g
is measured as the response variable
• EU and SU are different in this experiment
Fertilizer treatment is applied to the individual plant or pot
Measurements are taken from individual flower petals Measurements are taken from individual flower petals
Therefore EU = plant and SU = petal (pseudo-replication)
21. Pseudo-Replicationp
• Confusion between EU’s and SU’s can artificially inflate
sample sizes and artificially decrease p-valuessample sizes and artificially decrease p values
E.g. It is tempting to treat each flower petal as a unique sample
(n = 6 x 20 = 120), but the petals are pseudo-replicates
“P d li ti d th D i f E l i l Fi ld “Pseudoreplication and the Design of Ecological Field
Experiments” (Hurlbert 1984, Ecological Monographs)
• Pooling samples can create pseudo-replication problems
E.g. 12 fruit flies are available for a microarray experiment, but
t l fli i t 4 f 3 fli h t t h RNAmust pool flies into 4 groups of 3 flies each to get enough RNA
Once data are pooled, it is not appropriate to analyze each
individual separately in the statistical model
22. Biological vs Technical ReplicationBiological vs. Technical Replication
• Sometimes, experiments use multiple EU’s to, p p
investigate multiple sources of error with a statistical
model
E g When measurements are inaccurate you want to estimate variation E.g. When measurements are inaccurate, you want to estimate variation
between subjects and multiple measurements
E.g. To evaluate the precision of 2 lie detector machines, you could test 6
subjects measured by 4 technicians each in repeated measurements
Subject and machine effects have EU = subject (biological replicates) ,
but the technician effect has EU = measurement (technical replicates)
• These kinds of experiments must be analyzed withThese kinds of experiments must be analyzed with
appropriate statistical methods
Split-plot methods evaluate multiple EU’s in one model
23. No Biological Replication?No Biological Replication?
• Sometimes experiments have no biological replicatesSometimes experiments have no biological replicates
Experiments with cell lines (e.g. cancer cell lines)
Experiments with purified proteins, DNA, macromolecules
Experiments with bacteria, viruses or pathogens???
Be very careful when you interpret results• Be very careful when you interpret results
Technical replicates represent the precision of your methods
Significant results apply to your specific sampleS g ca esu s app y o you spec c sa p e
Results may not extend to larger populations
24. An Illustrative Examplep
4 batches of
vaccine
dumped into
one “pool”
single sample
from “pool”
tested in ten
egg assaysvaccine one pool from pool egg assays
• Does the experiment have any replication?
Biological replication? No. Four batches dumped into one pool.
Technical replication? Yes Ten assays used to detect contaminationTechnical replication? Yes. Ten assays used to detect contamination.
• What can we making inferences about?
Population of all vaccine batches? No. No biological replication.
Contamination of the single sample? Yes. Ten technical replicates used
Contamination of this specific pool? Maybe.
Contamination of these specific batches? Maybe.
25. What Is An Outlier?What Is An Outlier?
• An outlier is an observation (i.e. sampling unit) thatAn outlier is an observation (i.e. sampling unit) that
does not belong to the population of interest
Outliers can and should be legitimately removed from the analysis
Identifying outliers is a biological question, not a statistical question
• A high influence point is an observation that has a• A high influence point is an observation that has a
large impact on the fit of your statistical model
High influence points might be outliers or legitimate data
Several methods to identify and handle high influence points
26. Examples of Outliers
• Errors, glitches, typos and “non-data”
Bubbles or bright spots on a microarray
Typos from medical chart (e g age = 334) Typos from medical chart (e.g. age = 334)
• Legitimate samples, but out of scopeg p , p
Patients with comorbidities or other conditions (e.g. diabetes
patient in an AIDS study)
27. Examples of High Influencep g
• High Leverage points
Observations with extreme combinations of predictor andObservations with extreme combinations of predictor and
response variables (i.e. outskirts of the design space)
Identified using leverage plots
• Large Residuals
Represent large difference between predicted values from thep g p
model and the observed value from the sample
Large residual = poor model fit for that value
• Large influence on model fit
Remove the value and the model changes dramaticallyg y
28. High Leverage Pointsg e e age o s
• We expect no relationship Leverage: hii = X’i(X’X)-1Xi
between hat size and IQ
A single observation can
Leverage: hii X i(X X) Xi
• A single observation can
change the slope of the line
Hat size = 38, IQ = 190
• Extreme combinations of X
and Y variables produceand Y variables produce
high influence over the
analysisy
29. Leverage Plotsg
• Red “confidence curves” identify significant leveragey g g
Curves that completely overlap the blue line are not significant
Curves that largely do not overlap the blue line have significant
leverage
• If leverage is problematic, respond carefully
Identify and remove any outliers, if they exist
Consider alternative models variable transformations weighting etc Consider alternative models, variable transformations, weighting, etc.
31. ResidualsResiduals
• Residuals = Observed – Predicted
Also called “errors”
ei = Yi - Ŷi
• Represent the unexplained variation
Should be independent, identically distributed and randomp , y
Overall trend in residuals represents model fit
Large individual residuals may represent high influence
Several different computations for residuals exist Several different computations for residuals exist
32. Residuals PlotResiduals Plot
• Residuals vs. X variable
E l t d l fit l ti t di t i bl Evaluate model fit relative to one predictor variable
Suspect one variable fits poorly in multivariable model
• Residuals vs. Predicted values
Evaluate model fit with respect to the entire model
Good if you want a single plot for multivariable model
R id l itt d X i bl• Residuals vs. omitted X variable
Interesting trends if important variable was omitted
33. Good Model Fit
• Expect a rectangular or oval
shaped could of residuals
• Residuals vs. X variable used
to evaluate independence
E.g. Do we need to model a
curved relationship with Age?
• Residuals vs. predicted used
to evaluate assumption of
identically distributes errors
E.g. Non-constant variance
E.g. Larger errors with higher
response values
37. Alternative Residual Computations
• Studentized residuals
Divide each residual by the estimate of the standard deviationy
Easier to identify high influence points (e.g. > 3 s.d. away from mean)
D l t d id l• Deleted residuals
Compute residual after deleting one observation
Evaluate the effect of one observation on model fit
• Deviance or Pearson residuals
Computed for categorical response models (e.g. logistic regression)
Often do not follow typical trends of residuals from linear models
40. Other Indicators of High Influenceg
• DFFITS
Influence of single point on single fitted value
Look for DFFIT > 1 for small n or DFFIT > 2*sqrt(p/n) for large n
• DFBETAS• DFBETAS
Influence of single point on regression coefficients
Look for DFBETAS > 1 for small n or DFBETAS > 2 / sqrt(n) for large n
• Cook’s Distance
Influence of single point on all fitted values
C i t F( ) di t ib ti Compare against F(p, n – p) distribution
See Kutner, Nachtsheim, Neter and Li. 2005. Applied Linear Statistical
Models for more details
41. SolutionsSolutions
• Remove high influence points if they may be outliersg p y y
• Fit a completely new model to the data
• Transform variables
Transform X to change relationship between X and Yg p
Transform Y to change distribution of model errors
Use a weighting scheme to reduce their influence• Use a weighting scheme to reduce their influence
Use wi = 1 / sdi for non-constant variance
Use wi = 1 / Yi
2 or wi = 1 / Xi
2 to weight regions of plot
42. Log-transform XLog transform X
• Relationship between X and Y changes
• May reduce impact of some high influence points• May reduce impact of some high influence points
44. Weighting SchemesWeighting Schemes
• Use wi = 1 / sdi for non-constant variance
• Use w = 1 / Y2 or w = 1 / X2 to weight regions of plot• Use wi = 1 / Yi
2 or wi = 1 / Xi
2 to weight regions of plot
45. Th k YThank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
45