This document defines key concepts in biological data analysis including:
- Types of data like primary, secondary, continuous, and discrete data
- Data measurement scales like nominal, ordinal, and ratio scales
- Graphical representations of data like histograms, bar graphs, box plots, and frequency polygons
- Concepts of population, sampling methods like random and non-random sampling, and measures of central tendency like mean, median, and mode.
This document discusses confidence intervals for population means and proportions. It explains how to construct confidence intervals using the normal distribution for large sample sizes (n ≥ 30) and the t-distribution for small sample sizes. Formulas are provided for calculating margin of error and determining necessary sample size. Guidelines are given for determining whether to use the normal or t-distribution based on sample size and characteristics. Confidence intervals can be constructed for variance and standard deviation using the chi-square distribution.
Here are the steps to find the quartiles for this data set:
1. Order the data from lowest to highest: 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7
2. The number of observations is 16. To find the quartiles, we split the data into 4 equal parts.
3. n/4 = 16/4 = 4
4. Q1 is the median of the lower half of the data, which is the 4th observation: 2
5. Q2 is the median of all the data, which is also the 8th observation: 3
6. Q3 is the median of the upper half
Statistics is the science of collecting, organizing, analyzing, and drawing conclusions from data. Biostatistics applies statistical methods to biological topics like public health, clinical trials, genetics, and ecology. Descriptive statistics summarizes and presents data, while inferential statistics allows generalizing from samples to populations through hypothesis testing, determining relationships among variables, and making predictions. Key concepts include data, variables, populations, samples, measurement scales, and sampling methods. Common graphs for presenting data include histograms, bar charts, line graphs, and pie charts.
This document discusses confidence intervals, which provide a range of values that is likely to include an unknown population parameter based on a sample statistic. It defines key concepts like confidence level, confidence limits, and factors that determine how to set the confidence interval like sample size, population variability, and precision of values. It explains how larger sample sizes and more precise measurements result in narrower confidence intervals. Applications to clinical trials are discussed, showing how sample size impacts the ability to make definitive recommendations based on trial results.
This document provides an introduction to biostatistics. It defines key concepts such as statistics, data, variables, populations, and samples. It discusses different types of variables including quantitative and qualitative variables. It also describes different measurement scales including nominal, ordinal, interval and ratio scales. Sources of data and descriptive statistics are introduced. Descriptive statistics help summarize and organize data using tables, graphs, and numerical measures.
The document contains multiple choice questions about biostatistics and public health dentistry. It includes questions about measures of central tendency such as mean, median, and mode. Other questions address correlation, scales of measurement, random sampling, statistical significance, and skewed distributions. The questions are intended as a learning tool for dental students and professionals.
This document provides an introduction and overview of biostatistics. It defines key biostatistics terms like population, sample, parameter, statistic, quantitative vs. qualitative data, levels of measurement, descriptive vs. inferential biostatistics, and common statistical notations. It also discusses sources of health information and how computerized health management information systems are used to collect, analyze and report data.
This document discusses various measures used to quantify disease frequency in epidemiology. It describes measures of morbidity including incidence, prevalence, and disability rates. Incidence measures new cases over time while prevalence measures total current cases. Disability rates quantify limitations in activities. Measures of mortality are also presented, such as crude death rate, case fatality rate, and standardized mortality ratio. Standardization adjusts for differences in population characteristics to allow valid comparisons. Overall, the document provides an overview of key epidemiological metrics for quantifying disease burden and guiding public health efforts.
This document discusses confidence intervals for population means and proportions. It explains how to construct confidence intervals using the normal distribution for large sample sizes (n ≥ 30) and the t-distribution for small sample sizes. Formulas are provided for calculating margin of error and determining necessary sample size. Guidelines are given for determining whether to use the normal or t-distribution based on sample size and characteristics. Confidence intervals can be constructed for variance and standard deviation using the chi-square distribution.
Here are the steps to find the quartiles for this data set:
1. Order the data from lowest to highest: 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7
2. The number of observations is 16. To find the quartiles, we split the data into 4 equal parts.
3. n/4 = 16/4 = 4
4. Q1 is the median of the lower half of the data, which is the 4th observation: 2
5. Q2 is the median of all the data, which is also the 8th observation: 3
6. Q3 is the median of the upper half
Statistics is the science of collecting, organizing, analyzing, and drawing conclusions from data. Biostatistics applies statistical methods to biological topics like public health, clinical trials, genetics, and ecology. Descriptive statistics summarizes and presents data, while inferential statistics allows generalizing from samples to populations through hypothesis testing, determining relationships among variables, and making predictions. Key concepts include data, variables, populations, samples, measurement scales, and sampling methods. Common graphs for presenting data include histograms, bar charts, line graphs, and pie charts.
This document discusses confidence intervals, which provide a range of values that is likely to include an unknown population parameter based on a sample statistic. It defines key concepts like confidence level, confidence limits, and factors that determine how to set the confidence interval like sample size, population variability, and precision of values. It explains how larger sample sizes and more precise measurements result in narrower confidence intervals. Applications to clinical trials are discussed, showing how sample size impacts the ability to make definitive recommendations based on trial results.
This document provides an introduction to biostatistics. It defines key concepts such as statistics, data, variables, populations, and samples. It discusses different types of variables including quantitative and qualitative variables. It also describes different measurement scales including nominal, ordinal, interval and ratio scales. Sources of data and descriptive statistics are introduced. Descriptive statistics help summarize and organize data using tables, graphs, and numerical measures.
The document contains multiple choice questions about biostatistics and public health dentistry. It includes questions about measures of central tendency such as mean, median, and mode. Other questions address correlation, scales of measurement, random sampling, statistical significance, and skewed distributions. The questions are intended as a learning tool for dental students and professionals.
This document provides an introduction and overview of biostatistics. It defines key biostatistics terms like population, sample, parameter, statistic, quantitative vs. qualitative data, levels of measurement, descriptive vs. inferential biostatistics, and common statistical notations. It also discusses sources of health information and how computerized health management information systems are used to collect, analyze and report data.
This document discusses various measures used to quantify disease frequency in epidemiology. It describes measures of morbidity including incidence, prevalence, and disability rates. Incidence measures new cases over time while prevalence measures total current cases. Disability rates quantify limitations in activities. Measures of mortality are also presented, such as crude death rate, case fatality rate, and standardized mortality ratio. Standardization adjusts for differences in population characteristics to allow valid comparisons. Overall, the document provides an overview of key epidemiological metrics for quantifying disease burden and guiding public health efforts.
This document provides an overview of basic measurements used in epidemiology. It discusses tools like proportion, rate, and ratio. It also covers various measures of mortality like crude death rate, specific death rate, and proportional mortality rate. Measures of morbidity like incidence and prevalence are explained. The relationship between incidence and prevalence is described. Standardization techniques are introduced to make rates comparable between populations.
- Confidence intervals provide an estimated range of values that is likely to include an unknown population parameter, such as a mean, with a specified degree of confidence.
- The margin of error depends on the sample size, standard deviation, and confidence level, with a larger sample size and smaller standard deviation yielding a smaller margin of error.
- When the sample size is small, a t-distribution rather than normal distribution is used to construct the confidence interval due to the unknown population standard deviation. The t-distribution is wider than the normal and accounts for additional uncertainty from an unknown standard deviation.
This document provides an introduction to biostatistics. It defines biostatistics as applying statistics to biology, medicine, and public health. Some key points covered include:
- Francis Galton is considered the father of biostatistics.
- There are two main types of data: primary data collected directly and secondary data collected previously.
- Variables can be qualitative (categorical) or quantitative (numeric).
- Biostatistics is applied in areas like medicine, public health, and research to analyze data and draw conclusions.
- Common sources of health data include censuses, vital records, surveys, and hospital/disease records.
The document discusses key concepts in public health methodologies and biostatistics. It defines data as facts that can be processed by computers. Statistics is described as the study of collecting, summarizing, analyzing and interpreting data. Biostatistics applies statistical techniques to health-related fields like medicine. Descriptive statistics refers to methods used to describe data, while inferential statistics are used to draw conclusions from numeric data. Variables, grouped vs. ungrouped data, and types of variables are also outlined.
Measures of association like the relative risk (RR) and odds ratio (OR) quantify the strength between an exposure and disease. An RR or OR of 1 means no association, above 1 means positive association, and below 1 means negative association. The RR compares outcomes between exposed and unexposed groups in cohort studies, while the OR provides an estimate of the RR using case-control studies. Confidence intervals describe the precision of a point estimate, with a narrower interval indicating a more precise estimate. Interpreting if a 95% CI includes 1 determines if there is a statistically significant association.
This document defines key concepts related to hypothesis testing. It explains that a hypothesis is a temporary assumption made to conduct research. The two main types of hypotheses are the null hypothesis, which assumes no difference or relationship, and the alternative hypothesis, which assumes a significant difference or relationship. The document outlines the steps to test a hypothesis, which include setting the hypotheses, significance level, test statistic, critical region, calculating and comparing test statistics, and making a decision to accept or reject the null hypothesis. Common test statistics and significance levels are also defined.
This document provides an overview of biostatistics and research methodology. It defines key statistical terms and concepts, describes methods of data collection and presentation, discusses sampling and different sampling methods, and outlines the steps in research including defining a problem, developing objectives and hypotheses, collecting and analyzing data, and interpreting results. Common statistical analyses covered include measures of central tendency, dispersion, significance testing, correlation, and regression.
Hypothesis testing and estimation are used to reach conclusions about a population by examining a sample of that population.
Hypothesis testing is widely used in medicine, dentistry, health care, biology and other fields as a means to draw conclusions about the nature of populations
This document provides an introduction to biostatistics. It defines biostatistics as the development and application of statistical techniques to scientific research relating to human, plant, and animal life, with a focus on human life and health. It discusses the collection, organization, presentation, analysis, and interpretation of numerical data, which are the key components of statistics. Finally, it describes different types and measurement scales of data.
The document provides an overview of hypothesis testing. It begins by defining a hypothesis test and its purpose of ruling out chance as an explanation for research study results. It then outlines the logic and steps of a hypothesis test: 1) stating hypotheses, 2) setting decision criteria, 3) collecting data, 4) making a decision. Key concepts discussed include type I and type II errors, statistical significance, test statistics like the z-score, and assumptions of hypothesis testing. Factors that can influence a hypothesis test like effect size, sample size, and alpha level are also covered.
This document discusses key concepts in statistics. It defines statistics as the science of making decisions and drawing conclusions from data with uncertainty. It also defines key terms like population, sample, parameter, variable, observation, and data. The document outlines different types of variables, including quantitative and qualitative variables. It also describes different scales of measurement used in statistics, from nominal to ratio scales.
This document discusses a one-way analysis of variance (ANOVA) used to compare the effects of different oil types (A, B, C) on car mileage. It tests the null hypothesis that the mean mileages are equal against the alternative that at least two means differ. The ANOVA calculates sums of squares and F statistics to determine if there are significant differences between the treatment means, rejecting the null hypothesis if F exceeds the critical value. If differences exist, pairwise comparisons estimate the size of differences between each pair of means using confidence intervals.
This document discusses epidemiological modeling of infectious diseases. It describes several common deterministic compartmental models, including SIR, SIS, and SEIR models. These models divide the population into compartments based on disease status, such as susceptible, infected, and recovered. The models are formulated using systems of differential equations to capture the flows between compartments over time. The basic reproduction number is used to determine when herd immunity is achieved in a population through vaccination. More complex models incorporate additional factors like latent periods, temporary immunity, and age structure.
This document discusses key concepts in research methods and biostatistics, including hypothesis testing, random error, p-values, and confidence intervals. It explains that hypothesis testing involves determining if study findings reflect chance or a true effect. The p-value represents the probability of observing results as extreme or more extreme than what was observed by chance alone. A p-value less than 0.05 indicates statistical significance. Confidence intervals provide a range of values that are likely to contain the true population parameter.
O Biostatistics is the application of statistics to biological and medical data. It plays an integral role in modern medicine by analyzing data to determine treatment efficacy and develop clinical trials. A landmark study in biostatistics was the Framingham Heart Study, which through longitudinal data collection and analysis identified major risk factors for cardiovascular disease and influenced our current understanding of heart disease as a leading cause of death. Biostatistics obtains, analyzes, and interprets quantitative medical data to further human health.
A histogram is a graph that uses vertical rectangles to show the frequency distribution of data across ranges of values or intervals. It displays frequencies on the y-axis and class intervals on the x-axis. Each rectangle's width represents a class interval, and its height shows the frequency of observations within that interval. Histograms simplify complex data and provide an easy way to visualize patterns and compare multiple data sets.
This document discusses various study designs used in epidemiology including experimental, observational, and survey designs. Experimental designs include trials that systematically study disease treatment or prevention effects under controlled conditions. Observational designs include cohort studies, case-control studies, cross-sectional studies, and case-crossover studies that observe groups without experimental manipulation. Survey designs examine disease aggregates through cross-sectional or longitudinal population surveys, screening programs, and disease monitoring and surveillance systems.
This document provides an introduction to biostatistics. It discusses how biostatisticians are sometimes portrayed inaccurately as dull or boring. It also outlines some of the challenges in interpreting statistical results and medical data. The document introduces key concepts in biostatistics including populations and samples, descriptive statistics, variables, and approaches to summarizing data visually and numerically.
This document provides an overview of the SIR model for modeling epidemics. The SIR model divides a population into three categories: susceptibles (S), infecteds (I), and recovereds (R). Susceptibles can become infecteds through contact with infecteds at a rate of infection β. Infecteds move to the recovered category at a recovery rate γ. The basic reproduction number R0 represents the average number of infections caused by an infective in a susceptible population and determines whether an epidemic can occur. The SIR model and its variations are useful tools for understanding disease transmission dynamics and evaluating prevention strategies.
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg Girls High
The document discusses different types of data and methods for analyzing and displaying data. It describes quantitative and qualitative data, discrete and continuous data. It also explains various methods for interpreting data including pictorial methods like graphs and arithmetic methods like measures of central tendency and dispersion. Specific graphs and measures discussed include histograms, bar graphs, mean, median, mode, range, percentiles, quartiles, and interquartile range. The document also cautions about potential ways that graphs and statistics can be misleading.
This document provides an overview of measures of central tendency including the mean, median, and mode. It discusses how to calculate and interpret each measure using examples with data sets. The mean is calculated by adding all values and dividing by the total number. The median is the middle value when data is arranged in order. The mode is the value that occurs most frequently. Other measures discussed include the midrange and calculating the mean from a frequency distribution. Proper rounding of measures is also covered.
This document provides an overview of basic measurements used in epidemiology. It discusses tools like proportion, rate, and ratio. It also covers various measures of mortality like crude death rate, specific death rate, and proportional mortality rate. Measures of morbidity like incidence and prevalence are explained. The relationship between incidence and prevalence is described. Standardization techniques are introduced to make rates comparable between populations.
- Confidence intervals provide an estimated range of values that is likely to include an unknown population parameter, such as a mean, with a specified degree of confidence.
- The margin of error depends on the sample size, standard deviation, and confidence level, with a larger sample size and smaller standard deviation yielding a smaller margin of error.
- When the sample size is small, a t-distribution rather than normal distribution is used to construct the confidence interval due to the unknown population standard deviation. The t-distribution is wider than the normal and accounts for additional uncertainty from an unknown standard deviation.
This document provides an introduction to biostatistics. It defines biostatistics as applying statistics to biology, medicine, and public health. Some key points covered include:
- Francis Galton is considered the father of biostatistics.
- There are two main types of data: primary data collected directly and secondary data collected previously.
- Variables can be qualitative (categorical) or quantitative (numeric).
- Biostatistics is applied in areas like medicine, public health, and research to analyze data and draw conclusions.
- Common sources of health data include censuses, vital records, surveys, and hospital/disease records.
The document discusses key concepts in public health methodologies and biostatistics. It defines data as facts that can be processed by computers. Statistics is described as the study of collecting, summarizing, analyzing and interpreting data. Biostatistics applies statistical techniques to health-related fields like medicine. Descriptive statistics refers to methods used to describe data, while inferential statistics are used to draw conclusions from numeric data. Variables, grouped vs. ungrouped data, and types of variables are also outlined.
Measures of association like the relative risk (RR) and odds ratio (OR) quantify the strength between an exposure and disease. An RR or OR of 1 means no association, above 1 means positive association, and below 1 means negative association. The RR compares outcomes between exposed and unexposed groups in cohort studies, while the OR provides an estimate of the RR using case-control studies. Confidence intervals describe the precision of a point estimate, with a narrower interval indicating a more precise estimate. Interpreting if a 95% CI includes 1 determines if there is a statistically significant association.
This document defines key concepts related to hypothesis testing. It explains that a hypothesis is a temporary assumption made to conduct research. The two main types of hypotheses are the null hypothesis, which assumes no difference or relationship, and the alternative hypothesis, which assumes a significant difference or relationship. The document outlines the steps to test a hypothesis, which include setting the hypotheses, significance level, test statistic, critical region, calculating and comparing test statistics, and making a decision to accept or reject the null hypothesis. Common test statistics and significance levels are also defined.
This document provides an overview of biostatistics and research methodology. It defines key statistical terms and concepts, describes methods of data collection and presentation, discusses sampling and different sampling methods, and outlines the steps in research including defining a problem, developing objectives and hypotheses, collecting and analyzing data, and interpreting results. Common statistical analyses covered include measures of central tendency, dispersion, significance testing, correlation, and regression.
Hypothesis testing and estimation are used to reach conclusions about a population by examining a sample of that population.
Hypothesis testing is widely used in medicine, dentistry, health care, biology and other fields as a means to draw conclusions about the nature of populations
This document provides an introduction to biostatistics. It defines biostatistics as the development and application of statistical techniques to scientific research relating to human, plant, and animal life, with a focus on human life and health. It discusses the collection, organization, presentation, analysis, and interpretation of numerical data, which are the key components of statistics. Finally, it describes different types and measurement scales of data.
The document provides an overview of hypothesis testing. It begins by defining a hypothesis test and its purpose of ruling out chance as an explanation for research study results. It then outlines the logic and steps of a hypothesis test: 1) stating hypotheses, 2) setting decision criteria, 3) collecting data, 4) making a decision. Key concepts discussed include type I and type II errors, statistical significance, test statistics like the z-score, and assumptions of hypothesis testing. Factors that can influence a hypothesis test like effect size, sample size, and alpha level are also covered.
This document discusses key concepts in statistics. It defines statistics as the science of making decisions and drawing conclusions from data with uncertainty. It also defines key terms like population, sample, parameter, variable, observation, and data. The document outlines different types of variables, including quantitative and qualitative variables. It also describes different scales of measurement used in statistics, from nominal to ratio scales.
This document discusses a one-way analysis of variance (ANOVA) used to compare the effects of different oil types (A, B, C) on car mileage. It tests the null hypothesis that the mean mileages are equal against the alternative that at least two means differ. The ANOVA calculates sums of squares and F statistics to determine if there are significant differences between the treatment means, rejecting the null hypothesis if F exceeds the critical value. If differences exist, pairwise comparisons estimate the size of differences between each pair of means using confidence intervals.
This document discusses epidemiological modeling of infectious diseases. It describes several common deterministic compartmental models, including SIR, SIS, and SEIR models. These models divide the population into compartments based on disease status, such as susceptible, infected, and recovered. The models are formulated using systems of differential equations to capture the flows between compartments over time. The basic reproduction number is used to determine when herd immunity is achieved in a population through vaccination. More complex models incorporate additional factors like latent periods, temporary immunity, and age structure.
This document discusses key concepts in research methods and biostatistics, including hypothesis testing, random error, p-values, and confidence intervals. It explains that hypothesis testing involves determining if study findings reflect chance or a true effect. The p-value represents the probability of observing results as extreme or more extreme than what was observed by chance alone. A p-value less than 0.05 indicates statistical significance. Confidence intervals provide a range of values that are likely to contain the true population parameter.
O Biostatistics is the application of statistics to biological and medical data. It plays an integral role in modern medicine by analyzing data to determine treatment efficacy and develop clinical trials. A landmark study in biostatistics was the Framingham Heart Study, which through longitudinal data collection and analysis identified major risk factors for cardiovascular disease and influenced our current understanding of heart disease as a leading cause of death. Biostatistics obtains, analyzes, and interprets quantitative medical data to further human health.
A histogram is a graph that uses vertical rectangles to show the frequency distribution of data across ranges of values or intervals. It displays frequencies on the y-axis and class intervals on the x-axis. Each rectangle's width represents a class interval, and its height shows the frequency of observations within that interval. Histograms simplify complex data and provide an easy way to visualize patterns and compare multiple data sets.
This document discusses various study designs used in epidemiology including experimental, observational, and survey designs. Experimental designs include trials that systematically study disease treatment or prevention effects under controlled conditions. Observational designs include cohort studies, case-control studies, cross-sectional studies, and case-crossover studies that observe groups without experimental manipulation. Survey designs examine disease aggregates through cross-sectional or longitudinal population surveys, screening programs, and disease monitoring and surveillance systems.
This document provides an introduction to biostatistics. It discusses how biostatisticians are sometimes portrayed inaccurately as dull or boring. It also outlines some of the challenges in interpreting statistical results and medical data. The document introduces key concepts in biostatistics including populations and samples, descriptive statistics, variables, and approaches to summarizing data visually and numerically.
This document provides an overview of the SIR model for modeling epidemics. The SIR model divides a population into three categories: susceptibles (S), infecteds (I), and recovereds (R). Susceptibles can become infecteds through contact with infecteds at a rate of infection β. Infecteds move to the recovered category at a recovery rate γ. The basic reproduction number R0 represents the average number of infections caused by an infective in a susceptible population and determines whether an epidemic can occur. The SIR model and its variations are useful tools for understanding disease transmission dynamics and evaluating prevention strategies.
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg Girls High
The document discusses different types of data and methods for analyzing and displaying data. It describes quantitative and qualitative data, discrete and continuous data. It also explains various methods for interpreting data including pictorial methods like graphs and arithmetic methods like measures of central tendency and dispersion. Specific graphs and measures discussed include histograms, bar graphs, mean, median, mode, range, percentiles, quartiles, and interquartile range. The document also cautions about potential ways that graphs and statistics can be misleading.
This document provides an overview of measures of central tendency including the mean, median, and mode. It discusses how to calculate and interpret each measure using examples with data sets. The mean is calculated by adding all values and dividing by the total number. The median is the middle value when data is arranged in order. The mode is the value that occurs most frequently. Other measures discussed include the midrange and calculating the mean from a frequency distribution. Proper rounding of measures is also covered.
These ppts are designed only for educational purposes only.
All the rights are reserved to Rj Prashant
These PPTs are giving a general idea about educational statistics.
This document discusses various statistical methods used to organize and interpret data. It describes descriptive statistics, which summarize and simplify data through measures of central tendency like mean, median, and mode, and measures of variability like range and standard deviation. Frequency distributions are presented through tables, graphs, and other visual displays to organize raw data into meaningful categories.
This document provides an overview of statistics, including definitions, types of data, methods of presenting data, and common statistical measures. It defines statistics as the science of collecting, analyzing, and interpreting numerical data. There are two types of data: primary data collected directly by researchers and secondary data obtained from other sources. Common ways to present raw data include frequency distributions using tables or graphs such as bar graphs, histograms, and frequency polygons. The document also defines important statistical measures such as the median, mode, and mean.
The document discusses various concepts in economic statistics including:
- The meaning and functions of economic statistics which involves collecting, organizing, analyzing, and interpreting economic data.
- Types of statistical data based on scale of measurement (nominal, ordinal, interval, ratio), time reference (time series, cross-sectional, pooled, panel), and sources (primary, secondary).
- Methods for presenting quantitative data like frequency distributions, histograms, frequency polygons, and ogives. Qualitative data can be presented using bar charts, categorical distributions, and pie charts.
This document provides an overview of key concepts in statistics including measures of central tendency (mean, median, mode), measures of dispersion (variance, standard deviation), and central moments (skewness, kurtosis). It discusses calculating and comparing the mean, median, mode, and how they each describe the central position of a data distribution. It also explains how variance and standard deviation measure how spread out the data is from the mean. The document is intended as a textbook for students and general readers to learn basic statistical concepts.
This document provides an introduction to statistics. It defines statistics and discusses different types of data, including qualitative and quantitative data. It also explains various measures used to analyze and describe data, such as measures of central tendency (mean, median, mode), measures of dispersion (range, quartile deviation, mean deviation, standard deviation), and how to calculate mean deviation and standard deviation for both grouped and ungrouped data. Frequency polygons are introduced as a graphical way to represent frequency distributions of data.
This document provides an overview of key concepts in statistics including:
- Statistics involves collecting, organizing, analyzing, and interpreting numerical data.
- There are two main types of statistics: descriptive and inferential.
- Data can be categorical or quantitative. Common measures of central tendency are the mean, median, and mode.
- There are different sampling methods like random, stratified, and cluster sampling.
- Data is often organized and displayed using tables, graphs like histograms, bar charts and pie charts.
This document provides an overview of descriptive statistics and statistical concepts. It discusses topics such as data collection, organization, analysis, interpretation and presentation. It also covers frequency distributions, measures of central tendency (mean, median, mode), measures of variability (range, variance, standard deviation), and hypothesis testing. Hypothesis testing involves forming a null hypothesis and alternative hypothesis, and using statistical tests to either reject or fail to reject the null hypothesis based on sample data. Common statistical tests include ones for comparing means, variances or proportions.
This document provides an overview of biostatistics in orthodontics. It discusses topics like introduction to biostatistics, application and uses of statistics in orthodontics, methods of collecting and presenting data, measures of central tendency and dispersion, sampling techniques, and types of statistical tests. The key applications of statistics in orthodontics are to evaluate literature and prepare residents for lifelong learning by enabling them to understand statistical methodology used in research publications. It also describes various methods of presenting collected quantitative and qualitative data through tables, graphs, diagrams, and charts.
This document provides an overview of chapter 2 from an elementary statistics textbook. It covers exploring and organizing data using frequency distributions, histograms, graphs, scatterplots, and other methods. The objectives are to organize data using frequency distributions and represent data graphically. It defines key terms like population, sample, parameter, and statistic. It also describes procedures for constructing frequency distributions and calculating cumulative frequencies. Examples are provided to demonstrate how to organize various data sets into frequency distributions.
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfSuchita Rawat
This document discusses key concepts in research methodology and statistics. It defines statistics as dealing with the collection, analysis, and interpretation of quantitative and qualitative data. It then discusses various types of graphs used to visually represent data, such as bar graphs, pie charts, histograms, boxplots, and scatterplots. It also defines common measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation, IQR), and skewness.
This document defines key concepts in statistics such as different types of data, measures of central tendency, and measures of dispersion. It discusses ungrouped and grouped data, and defines discrete and continuous frequency distributions. Measures of central tendency explained include the mean, median, and mode. Measures of dispersion defined are range, mean deviation, standard deviation, and coefficient of variation. The coefficient of variation is presented as a relative measure used to compare the degree of variation between two data sets.
This document provides definitions and explanations of various geographic and statistical concepts. It discusses questionnaires, measures of central tendency, statistical series, measures of central location, dispersion, relative and absolute distance, location quotients, Lorenz curves, sampling methods, interviews, and observational methods.
This document discusses descriptive statistics and provides information on various descriptive statistics measures. It defines descriptive statistics as means of organizing and summarizing observations. It describes different types of descriptive statistics including measures of central tendency such as mean, median and mode, and measures of dispersion such as range, variance, standard deviation and interquartile range. Examples are provided to demonstrate how to calculate mean, median and mode from a data set. Additional measures like percentiles, quartiles, boxplots, skewness and kurtosis are also explained.
This document discusses statistical procedures and their applications. It defines key statistical terminology like population, sample, parameter, and variable. It describes the two main types of statistics - descriptive and inferential statistics. Descriptive statistics summarize and describe data through measures of central tendency (mean, median, mode), dispersion, frequency, and position. The mean is the average value, the median is the middle value, and the mode is the most frequent value in a data set. Descriptive statistics help understand the characteristics of a sample or small population.
This document provides an overview of quantitative data analysis. It discusses data preparation, descriptive statistics such as measures of central tendency and dispersion, inferential statistics, and interpretation of results. The key steps in quantitative analysis are described as data preparation, describing the data through descriptive statistics, drawing inferences through inferential statistics, and interpreting the findings. Common statistical techniques like mean, median, mode, standard deviation, and correlation are also summarized.
This document provides an introduction to statistics and data visualization. It discusses key topics including descriptive and inferential statistics, variables and types of data, sampling techniques, organizing and graphing data, measures of central tendency and variation, and random variables. Specifically, it defines statistics as collecting, organizing, summarizing, analyzing and making decisions from data. It also outlines the main differences between descriptive statistics, which describes data, and inferential statistics, which uses samples to make estimations about populations.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
1. Unit I Introduction
1. Data: The values recorded in an experiment or observation is called data.
1.1. Types of Data:
1.1.1. Primary Data: The data collected by an investigator is called primary data. It is first hand
information.
1.1.2. Secondary Data: The data collected from another source is called secondary data. Eg.
Data collected from newspapers, journals etc.
2. Biological Data: Biological data are data or measurements collected from biological sources,
which are often stored or exchanged in a digital form.
Eg. Examples of biological data are DNA base-pair sequences, and population data used in
ecology.
2.1. Data Measurement Scale: There are four data measurement scales.
2.1.1. Nominal Scale: Nominal scales are used for labeling variables, without
any quantitative value. “Nominal” scales could simply be called “labels.” Here are some
examples, below.
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called
“dichotomous.”
2.1.2. Ordinal Scale: Ordinal scales are typically measures of non-numeric concepts like
satisfaction, happiness, discomfort, etc. Examples of ordinal scales.
2. 2.2. Types of Biological Data:
2.2.1. Continuous Data: Continuous data are having value between two specified values; it is
called a continuous data.
It is not countable but measurable.
Values can be in integers or in decimal.
It is a qualitative data.
It has infinite or unlimited number of values.
E.g. Height of students, weight of students etc.
2.2.2. Discrete Data: The values are countable.
Data is in the form of integers (whole number) and not decimal.
It is quantitative.
It has finite number of values.
It is discontinuous.
E.g. Number of absentees in a class.
3. Graphical Distribution: Presenting data in the form of graph is called graphic presentation of
data.
3.1. Graph:
A graph is the geometric image of a data.
A graph is a diagram consisting of lines of statistical data.
A graph has two intersecting lines called axes.
The horizontal line is called X-axis. The vertical line is called Y-axis.
3.2. Frequency Distribution Graphs: Graphs obtained by plotting grouped data are called
frequency distribution graphs.
3. 3.2.1. Histogram: Histogram is a graph containing frequencies in the form of vertical rectangles.
It is an area diagram.
It is a graphical presentation of frequency distribution.
The X – axis is marked with class intervals.
The Y – axis is marked with frequencies.
Vertical rectangles are drawn as per the height of the frequency of each class. Rectangles
are drawn without any gap in between.
Histogram is a two dimensional diagram.
Fig: Example of a histogram
3.2.2. Bar Graph:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.
The bars can be plotted vertically or horizontally.
A vertical bar chart is sometimes called a line graph.
A bar graph shows comparisons among discrete categories.
4. Fig: Example of a Bar Graph
3.2.3. Box Plot:
In descriptive statistics, a box plot is a method for graphically depicting groups of
numerical data through their quartiles.
Box plots may also have lines extending vertically from the boxes (whiskers) indicating
variability outside the upper and lower quartiles, hence the terms box-and-whisker
plot and box-and-whisker diagram.
Outliers may be plotted as individual points.
Box plots are non-parametric: they display variation in samples of a statistical
population without making any assumptions of the underlying statistical distribution
Fig: Example of a Box Plot
5. 3.2.4. Frequency Polygon:
Polygon is a histogram with straight lines joining the midpoints of the top of the
rectangles.
Polygon means a figure with many angles.
It is an area diagram. Polygon is a graph. It is the graphical representation of
frequency distribution.
The X – axis is marked with class intervals.
The Y – axis is marked with frequencies.
The mid points of the top of the rectangles are joined by straight lines.
3.2.4.1. Uses of Frequency Polygon:
It simplifies a complex data.
It gives an idea of the pattern of distribution of variables in the population.
It facilitates comparison of two or more frequency distribution on the same graph.
It gives a clear picture of the same data.
Fig: Example of a Frequency Polygon
3.3. Cumulative Frequency Distribution: The cumulative frequency distribution is a statistical
table where the frequencies of preceding classes are added. As per example:
Class Frequency
0 – 9
10 – 19
20 – 29
30 – 39
3
9
11
7
Total Frequency 30
Table: Continuous Frequency Distribution
6. Class Frequency Cumulative Frequency
0 – 9
10 – 19
20 – 29
30 – 39
3
9
11
7
3
12
23
30
Table: Cumulative Frequency Distribution
4. Population:
In biology, a population is all the organisms of the same group or species, which live in
a particular geographical area, and have the capability of interbreeding.
The area of a sexual population is the area where inter-breeding is potentially possible
between any pair within the area, and where the probability of interbreeding is greater
than the probability of cross-breeding with individuals from other areas.
Fig: The distribution of human world population in 1994
5. Sampling:
Sampling is a method of collection of data.
Sample is a representative fraction of a population.
When the population is very large or infinite, sampling is the suitable method for data
collection.
Example: The Oxygen content of pond water can be found by titrating just 100 ml of
water.
7. There are two types of sampling, namely
1. Random Sampling.
2. Non-random Sampling.
5.1. Random Sampling:
In random sampling a small group is selected from a large population without any aim or
predetermination. The small group is called sample.
In this method each item of population has an equal and independent chance of being
included in the sample.
The random sample is selected by lottery method.
5.1.1. Simple Random Sampling:
In this method a sample is selected by which each item of the population has an equal and
independent chance of being included in the sample.
In this method, certain number of its are chosen at random without any pre-determined
basis.
5.1.2. Stratified Random Sampling:
This sampling technique is generally recommended when the population is
heterogeneous.
In this method, whole of the population is divided into strata or sub groups possessing the
similar characteristics.
Samples are selected taking equal proportion of items from each group.
Example: We want to select 100 students from a population of 1000 students, consist of
700 girls and 300 boys. So the whole population is divided into two strata – 700 girls and
300 boys. Now by simple sampling method selection of 70 girls and 30 boys are done to
get sample of 100 students.
5.1.3. Systematic Random Sampling:
It is also known as Quasi Random Sampling.
In this method, all the items are arranged in some spatial or temporal order.
Example: Persons listed alphabetically in a telephone directory, plants growing in rows
in field.
8. Unit II Descriptive Statistics
1. Measures of Central Tendency:
A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position wi
Measures of central tendency are sometimes called measures of central location.
They are also classed as summary statistics.
1.1. Mean (Arithmetic):
The mean is equal to the sum of all the values in the data set divided by the number
the data set. So, if we have n values in a data set and they have values x
mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol le
pronounced "sigma", which means "sum of...":
1.1.1. Significance:
One of its important properties is that it
value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part
of the calculation.
In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
1.2. Median:
The median is the middle score for a set of data that has been arranged in order of
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
1. Measures of Central Tendency:
A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data.
of central tendency are sometimes called measures of central location.
They are also classed as summary statistics.
The mean is equal to the sum of all the values in the data set divided by the number
the data set. So, if we have n values in a data set and they have values x1, x2, ..., x
(pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol le
pronounced "sigma", which means "sum of...":
of its important properties is that it minimizes error in the prediction of any one
value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part
In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
The median is the middle score for a set of data that has been arranged in order of
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
A measure of central tendency is a single value that attempts to describe a set of data
of central tendency are sometimes called measures of central location.
The mean is equal to the sum of all the values in the data set divided by the number of values in
, ..., xn, the sample
This formula is usually written in a slightly different manner using the Greek capitol letter, ,
error in the prediction of any one
value in your data set. That is, it is the value that produces the lowest amount of error
An important property of the mean is that it includes every value in your data set as part
In addition, the mean is the only measure of central tendency where the sum of the
The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median,
9. 65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it. This works fine when there is an odd
number of scores, in case of even number of scores (for example 10 scores) simply we need to
take the middle two scores and average the result. Example:
65 55 89 56 35 14 56 55 87 45
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get a median
of 55.5.
1.3. Mode:
The mode is the most frequent score in our data set. On a histogram it represents the highest bar
in a bar chart or histogram. Normally, the mode is used for categorical data where we wish to
know which the most common category, as illustrated below:
To find out mode of an ungrouped data, the values are arranged in an ascendig order. The value
which occurs maximum number of times is the mode.
18 21 23 23 25 25 25 27 29 29
In the above data 25 occurs maximum number of times. So 25 is the mode.
However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:
10. Mode is a positional average. It is a measure of central value. When data has one concentration
of frequency, it is called unimodal. When it has more than one concentration it is called
bimodal (for 2 concentrations), trimodal (for 3 concentration) simultaneously.
1.3.1. Significance:
No need for mathematical calculation.
Mode can be easily found out.
Although it is not that much reliable.
1.4. Range:
Range is the difference between the lowest value and highest value of a set of data.
Range = Largest value (Xm) – Smallest value (X0)
1.4.1. Coefficient of Range:
This is a relative measure of dispersion and is based on the value of the range. It is also called
range coefficient of dispersion. It is defined as:
11. 2.1. Variance: Variance is the average of the square differences from the mean. Steps involved
in calculating the variance:
i. Calculate mean.
ii. Subtract the mean from each value.
iii. Square the result.
iv. Add the squared numbers.
v. Take the average of the squared results.
2.2. Standard Deviation:
Standard deviation is a measure of deviation.
The Standard Deviation is a measure of how spreads out numbers are.
Its symbol is SD or σ (the Greek letter sigma).
The formula is: it is the square root of the Variance.
Example:
The heights of the dogs (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
12. The first step is to find the Mean:
Mean = (600 + 470 + 170 + 430 + 300)/ 5
= 1970/5
= 394
To calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2
= { 2062
+ 762
+ (−224)2
+ 362
+ (−94)2
} / 5
= (42436 + 5776 + 50176 + 1296 + 8836) / 5
= 108520 / 5
= 21704
So the Variance is 21,704
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation (σ) = √21704
= 147.32...
= 147 (to the nearest mm)
Therefore,
The square of the standard deviation is called variance.
2.3. Coefficient of Variations (CV): Coefficient of Variations is the standard deviation
expressed as a percentage of the mean. It is a relative measure of the mean.
Coefficient of Variations = (SD / X) * 100
Where,
SD = Standard Deviation, X = Mean
2.4. Grouped Data:
Grouped data is data that has been organized into groups known as classes.
13. Grouped data has been 'classified' and thus some level of data analysis has taken place,
which means that the data is no longer raw.
A data class is group of data which is related by some user defined property. For
example, while collecting the ages of the people we could group them into classes as
those in their teens, twenties, thirties, forties and so on. Each of those groups is called a
class.
Each of those classes is of a certain width and this is referred to as the Class
Interval or Class Size.
This class interval is very important when it comes to drawing Histograms and Frequency
diagrams. All the classes may have the same or different class size.
Below is an example of grouped data where the classes have the same class interval.
Age (years) Frequency
0 - 9 12
10 - 19 30
20 - 29 18
30 - 39 12
40 - 49 9
50 - 59 6
60 - 69 0
Below is an example of grouped data where the classes have different class interval.
Age (years) Frequency Class Interval
0 - 9 15 10
10 - 19 18 10
20 - 29 17 10
30 - 49 35 20
50 - 79 20 30
2.5. Graphical Methods: These methods are applied to visually describe data from a sample or
population. Graphs provide visual summaries of data which is more quickly and completely
describe essential information than tables of numbers.
14. There are many types of graphical representation:
2.5.1. The Bar Chart: To Construct a Bar Chart,
Place categories on the horizontal axis,
Then place frequency (or relative frequency) on the vertical axis.
After that construct vertical bars of equal width, one for each category.
Its height is proportional to the frequency (or relative frequency) of the
category.
Fig: Example of a Bar Chart
2.5.2. The Pie Chart: For drawing pie chart,
Make complete circle that represents the total number of measurements.
Partition into slices - one for each category.
Then, the size of a slice is proportional to the relative frequency of that
category.
Determine the angle of each slice by multiplying the relative frequency by
360 degree.
15. Fig: Example of a Pie Chart, Use of different Web Browser
2.5.3. Histogram: Histogram is a graph containing frequencies in the form of vertical rectangles.
It is an area diagram.
It is a graphical presentation of frequency distribution.
The X – axis is marked with class intervals.
The Y – axis is marked with frequencies.
Vertical rectangles are drawn as per the height of the frequency of each class. Rectangles
are drawn without any gap in between.
Histogram is a two dimensional diagram.
Fig: Example of a histogram
16. 2.5.4. Quantile plots: These visually portray the quantiles, or percentiles (which equals to the
quantiles times 100) of the distribution of sample data. Quantiles of importance such as the
median are easily discerned (quantile, or cumulative frequency = 0.5). Main benefits of Quantile
plots are as follows:
i. Arbitrary categories are not required, as with histograms or S-L's.
ii. All of the data are displayed, unlike a box plot.
iii. Every point has a distinct position, without overlap.
Fig: Example of a Quantile plots
2.5.5. Box Plot:
In descriptive statistics, a box plot is a method for graphically depicting groups of
numerical data through their quartiles.
Box plots may also have lines extending vertically from the boxes (whiskers) indicating
variability outside the upper and lower quartiles, hence the terms box-and-whisker
plot and box-and-whisker diagram.
Outliers may be plotted as individual points.
Box plots are non-parametric: they display variation in samples of a statistical
population without making any assumptions of the underlying statistical distribution
17. Fig: Example of a Box Plot
2.5.6. Benefits of Graphical representation:
1. Acceptability: Graphical report is acceptable to people who have busy schedule because
it easily highlights about the theme of the report. This helps to avoid wastage of time.
2. Comparative Analysis: Information can be compared in terms of graphical
representation. Such comparative analysis helps for quick understanding and attention.
3. Less cost: Information, if descriptive, involves huge time to present properly. It involves
more money to print the information but graphical presentation can be made in short but
catchy view to make the report understandable. It obviously involves less cost.
4. Decision Making: Business executives can view the graphs at a glance and can make
decision very quickly which is hardly possible through descriptive report.
5. Logical Ideas: If tables, design and graphs are used to represent information then a
logical sequence is created to clear the idea of the audience.
6. Helpful for less educated Audience: Less literate or illiterate people can understand
graphical representation easily because it does not involve going through line by line of
any descriptive report.
7. Less Effort and Time: To present any table, design, image or graphs require less effort
and time. Furthermore, such presentation makes quick understanding of the information.
8. Less Error and Mistakes: Qualitative or informative or descriptive reports involve
errors or mistakes. As graphical representations are exhibited through numerical figures,
tables or graphs, it usually involves less error and mistake.
9. A complete Idea: Such representation creates clear and complete idea in the mind of
audience. Reading hundred pages may not give any scope to make decision. But an
instant view or looking at a glance obviously makes an impression in the mind of
audience regarding the topic or subject.
10. Use in the Notice Board: Such representation can be hanged in the notice board to
quickly raise the attention of employees in any organization.
18. 2.5.7. Graphical representation has some drawbacks also:
1. Expensive: Graphical representations of reports are costly because it involves images,
colors and paints. Combination of material with human efforts makes the graphical
presentation expensive.
2. More time: Graphical representation involves more time as it requires gra
which are dependent to more time.
3. Errors and Mistakes: Since graphical representations are complex, there is chance of
errors and mistake. This causes problems for better understanding to general people.
4. Lack of Privacy: Graphical represent
may hamper the objective to keep something secret.
5. Problems to select the appropriate method:
various graphical methods and ways. Which should be the suitable method is
select.
6. Problem of Understanding:
representation because it involves various technical matters which are complex to general
people.
2.6. Obtaining Descriptive Statistics on Computer (MS Ex
Suppose, we may have the scores of 14 participants for a test.
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click
Graphical representation has some drawbacks also:
Graphical representations of reports are costly because it involves images,
and paints. Combination of material with human efforts makes the graphical
Graphical representation involves more time as it requires gra
which are dependent to more time.
Since graphical representations are complex, there is chance of
errors and mistake. This causes problems for better understanding to general people.
Graphical representation makes full presentation of information which
may hamper the objective to keep something secret.
Problems to select the appropriate method: Information can be presented through
various graphical methods and ways. Which should be the suitable method is
Problem of Understanding: All people cannot understand the meaning of graphical
representation because it involves various technical matters which are complex to general
Obtaining Descriptive Statistics on Computer (MS Excel):
may have the scores of 14 participants for a test.
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.
Graphical representations of reports are costly because it involves images,
and paints. Combination of material with human efforts makes the graphical
Graphical representation involves more time as it requires graphs and figures
Since graphical representations are complex, there is chance of
errors and mistake. This causes problems for better understanding to general people.
ation makes full presentation of information which
Information can be presented through
various graphical methods and ways. Which should be the suitable method is very hard to
All people cannot understand the meaning of graphical
representation because it involves various technical matters which are complex to general
19. Note: can't find the Data Analysis button? Click here to load the
2. Select Descriptive Statistics and click OK.
3. Select the range A2:A15 as the Input Ran
4. Select cell C1 as the Output Range.
5. Make sure summary statistics is checked.
6. Click OK.
can't find the Data Analysis button? Click here to load the Analysis ToolPak
2. Select Descriptive Statistics and click OK.
3. Select the range A2:A15 as the Input Range.
4. Select cell C1 as the Output Range.
5. Make sure summary statistics is checked.
ToolPak add-in.
20. Result:
3. Case Study:
In the social sciences and life sciences
in-depth, and detailed examination of a subject of study (the
contextual conditions.
Case studies can be produced by following a formal
likely to appear in formal research venues, as journals and professional conferences, rather
popular works. The resulting body of 'case study research' has long had a prominent place in
many disciplines and professions, ranging from psychology, anthropology, sociology, and
political science to education, clinical science, social work, and ad
3.1. Types of Case Studies:
Under the more generalized category of case study exist several subdivisions, each of which is
custom selected for use depending upon the goals of the investigator. These types of case study
include the following:
Illustrative case studies: These are primarily descriptive studies. They typically utilize one
or two instances of an event to show the existing situation. Illustrative case studies serve
primarily to make the unfamiliar familiar and to give reade
topic in question.
Exploratory (or pilot) case studies
implementing a large scale investigation. Their basic function is to help identify questions
and select types of measurement prior to the main investigation. The primary pitfall of this
life sciences, a case study is a research method involving an up
depth, and detailed examination of a subject of study (the case), as well as its related
Case studies can be produced by following a formal research method. These case studies are
likely to appear in formal research venues, as journals and professional conferences, rather
popular works. The resulting body of 'case study research' has long had a prominent place in
many disciplines and professions, ranging from psychology, anthropology, sociology, and
political science to education, clinical science, social work, and administrative science.
Under the more generalized category of case study exist several subdivisions, each of which is
custom selected for use depending upon the goals of the investigator. These types of case study
These are primarily descriptive studies. They typically utilize one
or two instances of an event to show the existing situation. Illustrative case studies serve
primarily to make the unfamiliar familiar and to give readers a common language about the
Exploratory (or pilot) case studies: These are condensed case studies performed before
implementing a large scale investigation. Their basic function is to help identify questions
ement prior to the main investigation. The primary pitfall of this
is a research method involving an up-close,
), as well as its related
method. These case studies are
likely to appear in formal research venues, as journals and professional conferences, rather than
popular works. The resulting body of 'case study research' has long had a prominent place in
many disciplines and professions, ranging from psychology, anthropology, sociology, and
ministrative science.
Under the more generalized category of case study exist several subdivisions, each of which is
custom selected for use depending upon the goals of the investigator. These types of case study
These are primarily descriptive studies. They typically utilize one
or two instances of an event to show the existing situation. Illustrative case studies serve
rs a common language about the
These are condensed case studies performed before
implementing a large scale investigation. Their basic function is to help identify questions
ement prior to the main investigation. The primary pitfall of this
21. type of study is that initial findings may seem convincing enough to be released prematurely
as conclusions.
Cumulative case studies: These serve to aggregate information from several sites collected
at different times. The idea behind these studies is that the collection of past studies will
allow for greater generalization without additional cost or time being expended on new,
possibly repetitive studies.
Critical instance case studies: These examine one or more sites either for the purpose of
examining a situation of unique interest with little to no interest in generalization, or to call
into question a highly generalized or universal assertion. This method is useful for answering
cause and effect questions.
22. Unit III Probability and Distribution:
1. Probability: Probability is the proportion of times an event occurs in a set of trials. The word
‘probability’ means chance; likely to happen.
Probability is calculated by following formula:
P = e / t
Where,
P = Probability
e = number of times an event occurs or frequency
t = total number of trials or items
The probability value is always a fraction falling between 1 and 0.
Example:
When a dice numbered from 1 to 6 is tossed, the total number of chance is 6. The
probability of any number is, 1/6 = 0.17
p is the probability of an event to occur. Q is the probability of the event not occurring.
So,
When, p is known, q can be calculated.
q = 1 – p
p = 1 - q
1.1. Laws of Probability: There are two types of theorems of probability, namely
1. Addition theorems. 2. Multiplications theorems.
1.1.1. Addition Theorem:
The probability of the occurrence of a mutually exclusive event is the sum of the
individual probabilities of the events.
Mutually exclusive events cannot occur simultaneously.
The occurrence of the one event prevents the occurrence of the other events.
Example: In coin tossing experiment, the occurrence of head excludes the occurrence of
tail.
23. If probability of head is A and that of tail is B.
Then,
Probability of Head = p(A) = p(A) + p(B)
Probability of Tail = p(B) = p(A) + p(B)
1.1.2. Multiplication Theorem:
The probability of the occurrence of two independent events is the product of their
individual probabilities.
For independent event, the probability is calculated by multiplication.
The independent event will not affect the occurrence of other events. When two coins are
tossed, the result of the first coin does not affect the second coin.
Example: The probability of two independent events as P(A) and P(B)
Probability of A = P(A) = P(A) x P(B)
Probability of B = P(B) = P(A) x P(B)
2. Random events: the events whose outcome is unknown are called random experiments. For
example when we toss a coin, we do not know if it will land heads up or tails up. Hence tossing a
coin is a random experiment. Another example is the result of an interview or examination.
When we speak about random experiments, we have to know what the sample space is.
Sample space denoted by S is the set of all possible outcomes of a random experiment.
Example: consider the random experiment of tossing a die. Let us write down the sample space
S here.
The sample space is all the possible outcomes here. What are the possible outcomes when we
toss a die once? As we know a die has 6 faces numbered 1,2,3,4,5,6. When we toss it once, only
one of the face will turn up. Hence the sample space is
S= {1,2,3,4,5,6}
Consider one more simple example of tossing two coins. Let us write down the sample space
here.
S= {(H,H),(H,T),(T,H),(T,T)}
24. Here,
H: head; T: tail
3. Events-exhaustive: Two or more events are said to be exhaustive if there is a certain chance
of occurrence of at least one of them when they are all considered together. Exhaustive event can
be either elementary or even compound.
Example: consider the experiment of a fair die being thrown. Then there are six
outcomes and all of them are equally likely to occur. Also the events of getting different numbers
taken together are exhaustive as together at least one of them is certain to happen. For getting a 2
or 5, sure will get one of the numbers during the experiment. So events are exhaustive.
4. Mutually Exclusive Event: Mutually exclusive events cannot occur together simultaneously.
The occurrence of one event prevents the occurrence of the other event. The mutually exclusive
events are connected by the words ‘either or’.
Example: Head and tail of a coin.
5. Equally Likely Events: Equally likely events have equal chances of occurrence.
Example: Winning or losing in a game. Head or tail of a coin.
6. Binomial Distribution: A binomial distribution can be thought of as simply the probability
of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times.
The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means
two, or twice).
Examples of binomial distribution problems:
The number of defective/non-defective products in a production run.
Yes/No Survey (such as asking 150 people if they watch ABC news).
Vote counts for a candidate in an election.
The number of successful sales calls.
The number of male/female workers in a company
6.1. Criteria of Binomial distributions:
i. The number of observations or trials is fixed. In other words, you can only figure out
the probability of something happening if you do it a certain number of times. This is
common sense—if you toss a coin once; your probability of getting a tails is 50%. If
you toss a coin a 20 times, your probability of getting a tails is very, very close to
100%.
25. ii. Each observation or trial is independent. In other words, none of your trials have an
effect on the probability of the next trial.
iii. The probability of success (tails, heads, fail or pass) is exactly the same from one trial
to another.
6.2. Formula:
Notations for Binomial Distribution and the Mass Formula:
Where:
P is the probability of success on any trail.
q = 1- P – the probability of failure
n – the number of trails/experiments
x – The number of successes, it can take the values 0, 1, 2, 3 . . . n.
nCx = n!/x!(n-x) and denotes the number of combinations of n elements
taken x at a time.
Assuming what the nCx means, we can write the above formula in this way:
Problem: A box of candies has many different colors in it. There is a 15% chance of getting a
pink candy. What is the probability that exactly 4 candies in a box are pink out of 10?
We have that:
n = 10, p=0.15, q=0.85, x=4
When we replace in the formula:
Interpretation: The probability that exactly 4 candies in a box are pink is 0.04.
26. 6.3. Properties of binomial distribution:
1. Binomial distribution is applicable when the trials are independent and each trial has just
two outcomes success and failure. It is applied in coin tossing experiments, sampling
inspection plan, genetic experiments and so on.
2. Binomial distribution is known as bi-parametric distribution as it is characterized by two
parameters n and p. This means that if the values of n and p are known, then the distribution is
known completely.
3. The mean of the binomial distribution is given by
μ = np
4. Depending on the values of the two parameters, binomial distribution may be uni-modal or bi-
modal.
To know the mode of binomial distribution, first we have to find the value of (n+1)p.
(n+1)p is a non integer --------> Uni-modal
Here, the mode = the largest integer contained in (n+1)p
(n+1)p is a integer --------> Bi-modal
Here, the mode = (n+1|)p, (n+1)p - 1
5. The variance of the binomial distribution is given by
σ² = npq
6. Since p and q are numerically less than or equal to 1,
npq < np
That is, variance of a binomial variable is always less than its mean.
7. Variance of binomial variable X attains its maximum value at p = q = 0.5 and this maximum
value is n/4.
27. 8. Additive property of binomial distribution.
Let X and Y are the two independent binomial variables.
X is having the parameters n₁ and p
and
Y is having the parameters n₂ and p.
Then (X+Y) will also be a binomial variable with the parameters (n₁ + n₂) and p
7. Poisson Distribution:
Poisson distribution was devised by Poisson in 1837.
It is a discrete frequency distribution.
Poisson distribution describes the occurrence of rate events and the small events. Hence it
is called law of improbable events.
When the probability of the event is very rare in a large number of trials, the resulting
distribution is called Poisson distribution.
Example: Number of death due to heart attack in a hospital or a town.
7.1. Properties of Poisson Distribution:
The probability of the success of the event (p) is very small and approaches zero.
The probability of the failure of the event (q) is very high and almost equal to 1 and n is
also large.
Poisson distribution has a single parameter called mean denoted by m.
m = np = constant.
The formula used for Poisson distribution is as follows:
Probability of r success P(r) = e-m
mr
/ 1
p = Probability
r = 0, 1, 2, 3….n successes.
e = 2.7183 (constant)
SD (Standard Deviation) of Poisson distribution is = √m = √np
Variance = SD2
= m = np
8. Normal Distribution: Normal distribution is a continuous probability distribution. In this
distribution the values are clustered closely around the centre and the values decrease towards
the left and right.
28. Example: The height of students in a class is a typical example for normal distribution.
The height of most students will be between 150cm and 170cm. The height of only a few
students will be less than 150 cm and the height of only a few students will be above 170
Thus there is an increasing number towards the middle point and a decreasing number towards
the end.
8.1. Properties of Normal Distribution:
The graph obtained for normal distribution is called normal distribution curve.
The normal distribution curv
number of individuals (frequency) in the Y axis.
The normal distribution curve is symmetrical. It is bell shaped.
Fig: Example of a Normal Distribution Curve
Normal distribution curve is als
Coral Gauss.
The normal distribution curve is a continuous distribution. It is associated with height,
weight, age, rate of respiration etc.
It has only one maximum peak. Hence it is a unimodel curve.
The height of normal curve is maximum at its mean.
Mean, median and mode are equal for normal distribution. Mean = Median = Mode.
Most of the values are clustered around the mean and there are relatively
observations at the extreme ends.
The height of students in a class is a typical example for normal distribution.
students will be between 150cm and 170cm. The height of only a few
will be less than 150 cm and the height of only a few students will be above 170
Thus there is an increasing number towards the middle point and a decreasing number towards
8.1. Properties of Normal Distribution:
The graph obtained for normal distribution is called normal distribution curve.
The normal distribution curve is obtained when the values are given in the X axis and the
number of individuals (frequency) in the Y axis.
The normal distribution curve is symmetrical. It is bell shaped.
Example of a Normal Distribution Curve
Normal distribution curve is also called Gaussian curve, named after the discoverer
The normal distribution curve is a continuous distribution. It is associated with height,
weight, age, rate of respiration etc.
It has only one maximum peak. Hence it is a unimodel curve.
The height of normal curve is maximum at its mean.
Mean, median and mode are equal for normal distribution. Mean = Median = Mode.
Most of the values are clustered around the mean and there are relatively
observations at the extreme ends.
The height of students in a class is a typical example for normal distribution.
students will be between 150cm and 170cm. The height of only a few
will be less than 150 cm and the height of only a few students will be above 170cm.
Thus there is an increasing number towards the middle point and a decreasing number towards
The graph obtained for normal distribution is called normal distribution curve.
e is obtained when the values are given in the X axis and the
o called Gaussian curve, named after the discoverer
The normal distribution curve is a continuous distribution. It is associated with height,
Mean, median and mode are equal for normal distribution. Mean = Median = Mode.
Most of the values are clustered around the mean and there are relatively a few
29. The normal curve never touches the horizontal axis.
The mean deviation is equal to standard deviation.
30. Unit IV: Correlation and Regression Analysis
1. Correlation:
Correlation, in the finance and investment industries, is a statistic that measures the
degree to which two securities move in relation to each other.
Correlations are used in advanced portfolio management, computed as the correlation
coefficient, which has a value that must fall between -1.0 and +1.0.
Correlation is a statistic that measures the degree to which two variables move in relation
to each other.
1.1. Definition of Correlation:
According to Taro Yamane, “Correlation analysis is a discussion of the degree of
closeness of the relationship between two variables.”
According to Ya Lun Chou, “Correlation analysis attempts to determine the degree of
relationship between variables.”
According to Prof. Bodding, “Wherever some definite connection exists between 2 or
more groups, classes or series of data, there is said to be a correlation.”
A very simple definition is given by A. M. Tuttle, “An analysis of the co-variation of two
or more variables is usually called correlation.”
1.2. The Formula for Correlation:
Correlation measures association, but does not tell you if x causes y or vice versa, or if
the association is caused by some third (perhaps unseen) factor.
1.3. Positive Correlation: A perfect positive correlation means that the correlation coefficient is
exactly 1. This implies that as one security moves, either up or down, the other security moves in
lockstep, in the same direction.
1.4. Negative Correlation: A perfect negative correlation means that two assets move in
opposite directions, while a zero correlation implies no relationship at all.
31. 1.5. Calculation of Correlation: (Karl Pearson’s Coefficient of Correlation)
Karl Pearson, a great biometrician and statistician, suggested a mathematical method for
measuring the magnitude of linear relationship between two variables.
Karl Pearson’s method is the most widely used method in practice and is known as
Pearsonian coefficient of correlation. It is denoted by the symbol “ ”. The simplest formula
is-
The value of the coefficient of correlation shall always lie between +1 and -1, when =
+1, then there is a perfect positive correlation between the two variables. When = -1, then
there is perfect negative correlation between the two variables. When = 0, then there is no
relationship or correlation between two variables. Theoretically, we get values which lie between
+1 and -1; but normally the value lies between +0.8 and -0.5.
1.6. Problem: Find the coefficient of correlation between the age of husbands (X) and the age of
wives (Y).
X 23 27 28 28 29 30 31 33 35 36
Y 18 20 22 27 21 29 27 29 28 29
Solution:
32.
33. 2. Covariance:
In probability theory and statistics, covariance is a measure of the joint variability of
two random variables.
If the greater values of one variable mainly correspond with the greater values of the
other variable, and the same holds for the lesser values, (i.e., the variables tend to show
similar behavior), the covariance is positive.
In the opposite case, when the greater values of one variable mainly correspond to the
lesser values of the other, (i.e., the variables tend to show opposite behavior), the
covariance is negative.
The sign of the covariance therefore shows the tendency in the linear
relationship between the variables.
2.1. The Covariance Formula:
The formula is:
Cov(X,Y) = Σ E((X-μ)E(Y-ν)) / n-1 where:
X is a random variable
E(X) = μ is the expected value (the mean) of the random variable X and
E(Y) = ν is the expected value (the mean) of the random variable Y
n = the number of items in the data set
Example: Calculate covariance for the following data set:
X: 2.1, 2.5, 3.6, 4.0 (mean = 3.1)
Y: 8, 10, 12, 14 (mean = 11)
Substitute the values into the formula and solve:
Cov(X,Y) = ΣE((X-μ)(Y-ν)) / n-1
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3
= 6.8/3
= 2.267
The result is positive, meaning that the variables are positively related.
34. 2.2. Covariance in Excel: Overview
Covariance gives you a positive number if the variables are positively related. You’ll get a
negative number if they are negatively related. A high covariance basically indicates there is a
strong relationship between the variables. A low value means there is a weak relationship.
Covariance in Excel: (Steps)
Step 1: Enter your data into two columns in Excel. For example, type your X values into column
A and your Y values into column B.
Step 2: Click the “Data” tab and then click “Data analysis.” The Data Analysis window will
open.
Step 3: Choose “Covariance” and then click “OK.”
Step 4: Click “Input Range” and then select all of your data. Include column headers if you have
them.
Step 5: Click the “Labels in First Row” check box if you have included column headers in your
data selection.
Step 6: Select “Output Range” and then select an area on the worksheet. A good place to select
is an area just to the right of your data set.
Step 7: Click “OK.” The covariance will appear in the area you selected in Step 5.
35. 3. Scatter Diagram: A scatter diagram is a graph that shows the relationship between two
variables. Scatter diagrams can demonstrate a relationship between any element of a process,
environment, or activity on one axis and a quality defect on the other axis.
3.1. Type of Scatter Diagram
According to the type of correlation, scatter diagrams can be divided into following categories:
Scatter Diagram with No Correlation
Scatter Diagram with Moderate Correlation
Scatter Diagram with Strong Correlation
3.1.1. Scatter Diagram with No Correlation
This type of diagram is also known as “Scatter Diagram with Zero Degree of Correlation”.
In this type of scatter diagram, data points are spread so randomly that you cannot draw any line
through them.
In this case you can say that there is no relation between these two variables.
3.1.2. Scatter Diagram with Moderate Correlation
This type of diagram is also known as “Scatter Diagram with Low Degree of Correlation”.
36. Here, the data points are little closer together and you can feel that some kind of relation exists
between these two variables.
3.1.3. Scatter Diagram with Strong Correlation
This type of diagram is also known as “Scatter Diagram with High Degree of Correlation”.
In this diagram, data points are grouped very close to each other such that you can draw a line by
following their pattern.
In this case you will say that the variables are closely related to each other.
As discussed earlier, we can also divide the scatter diagram according to the slope, or trend, of
the data points:
37. Scatter Diagram with Strong Positive Correlation
Scatter Diagram with Weak Positive Correlation
Scatter Diagram with Strong Negative Correlation
Scatter Diagram with Weak Negative Correlation
Scatter Diagram with Weakest (or no) Correlation
Strong positive correlation means there is a clearly visible upward trend from left to right; a
strong negative correlation means there is a clearly visible downward trend from left to right. A
weak correlation means the trend, up of down, is less clear. A flat line from left to right is the
weakest correlation, as it is neither positive nor negative and indicates the independent variable
does not affect the dependent variable.
3.1.4. Scatter Diagram with Strong Positive Correlation
This type of diagram is also known as Scatter Diagram with Positive Slant.
In positive slant, the correlation will be positive, i.e. as the value of x increases, the value of y
will also increase. You can say that the slope of straight line drawn along the data points will go
up. The pattern will resemble the straight line.
For example, if the temperature goes up, cold drink sales will also go up.
38. 3.1.5. Scatter Diagram with Weak Positive Correlation
Here as the value of x increases the value of y will also tend to increase, but the pattern will not
closely resemble a straight line.
3.1.6. Scatter Diagram with Strong Negative Correlation
This type of diagram is also known as Scatter Diagram with Negative Slant.
In negative slant, the correlation will be negative, i.e. as the value of x increases, the value of y
will decrease. The slope of a straight line drawn along the data points will go down.
For example, if the temperature goes up, sales of winter coats goes down.
39. 3.1.7. Scatter Diagram with Weak Negative Correlation
Here as the value of x increases the value of y will tend to decrease, but the pattern will not be as
well defined.
4. Dot Diagram:
A dot diagram or dot plot is a statistical chart consisting of data points plotted on a
fairly simple scale, typically using filled in circles.
The dot plot as a representation of a distribution consists of group of data points plotted
on a simple scale.
Dot plots are used for continuous, quantitative, univariate data.
Data points may be labelled if there are few of them.
Dot plots are one of the simplest statistical plots, and are suitable for small to moderate
sized data sets.
They are useful for highlighting clusters and gaps, as well as outliers.
Their other advantage is the conservation of numerical information.
40. 5. General Concept of Regression:
Regression is the measures of the average relationship between two or more variables in
terms of the original units of the data.
Estimation of regression is called regression analysis.
In regression analysis two variables are involved. One variable is called dependent
variable and the other is called independent variable.
E.g. the yield of rice and rainfall are related. Yield of rice is a dependent variable and
rainfall is an independent variable.
5.1. Definitions:
“Regression analysis attempts to establish the nature of the relationship between
variables, that is, to study the functional relationship between the variables and thereby
provide a mechanism for predicting or forecasting.” – Ya-Lun-Chow.
“Regression is the measure of the average relationship between two or more variables in
terms of the original units of the data” – Blair.
5.2. Regression Lines:
The graphic representation of regression is called regression line.
One variable is represented as X and the other one as Y.
In a simple linear regression, there are two regression lines constructed for the
relationship between two variables, say X and Y.
One regression line shows regression of X upon Y and the other shows the regression of
Y upon X.
When there is perfectly positive correlation (+1) or perfectly negative correlation (-1) the
two regression lines will coincide with each other i.e., there will be only one line.
If the regression lines are nearer to each other, then there is a higher degree of correlation.
If the two lines are farther away from each other, then there is lesser degree of
correlation.
If = 0, both variables are independent. There is no correlation. So both will cut each
other at right angle.
41.
42. 5.3. Regression Coefficient & its Properties:
5.3.1. level-level model
The basic form of linear regression (without the residuals)
The basic formula for linear regression can be seen above. In the formula, y denotes the dependent
variable and x is the independent variable. For simplicity let’s assume that it is univariate
regression, but the principles obviously hold for the multivariate case as well.
To put it into perspective, let’s say that after fitting the model we receive:
Intercept (a)
x is continuous and centered (by subtracting the mean of x from each observation, the
average of transformed x becomes 0) — average y is 3 when x is equal to the sample mean
x is continuous, but not centered — average y is 3 when x = 0
x is categorical — average y is 3 when x = 0 (this time indicating a category, more on this
below)
Coefficient (b)
x is a continuous variable
Interpretation: a unit increase in x results in an increase in average y by 5 units, all other variables
held constant.
x is a categorical variable
43. This requires a bit more explanation. Let’s say that x describes gender and can take values
(‘male’, ‘female’). Now let’s convert it into a dummy variable which takes values 0 for males and
1 for females.
Interpretation: average y is higher by 5 units for females than for males, all other variables held
constant.
5.3.2. log-level model
Log denotes the natural logarithm
Typically we use log transformation to pull outlying data from a positively skewed distribution
closer to the bulk of the data, in order to make the variable normally distributed. In the case of
linear regression, one additional benefit of using the log transformation is interpretability.
Example of log transformation: right — before, left — after. Source
As before, let’s say that the formula below presents the coefficients of the fitted model.
44. Intercept (a)
Interpretation is similar as in the vanilla (level-level) case, however, we need to take the exponent
of the intercept for interpretation exp(3) = 20.09. The difference is that this value stands for
the geometric mean of y as opposed to the arithmetic mean in case of the level-level model).
Coefficient (b)
The principles are again similar to the level-level model when it comes to interpreting
categorical/numeric variables. Analogically to the intercept, we need to take the exponent of the
coefficient: exp(b) = exp(0.01) = 1.01. This means that a unit increase in x causes a 1% increase
in average (geometric) y, all other variables held constant.
Two things worth mentioning here:
There is a rule of thumb when it comes to interpreting coefficients of such a model. If abs(b)
< 0.15 it is quite safe to say that when b = 0.1 we will observe a 10% increase in y for a unit
change in x. For coefficients with larger absolute value, it is recommended to calculate the
exponent.
When dealing with variables in [0, 1] range (like a percentage) it is more convenient for
interpretation to first multiply the variable by 100 and then fit the model. This way the
interpretation is more intuitive, as we increase the variable by 1 percentage point instead of
100 percentage points (from 0 to 1 immediately).
5.3.3. level-log model
Let’s assume that after fitting the model we receive:
The interpretation of the intercept is the same as in the case of the level-level model.
45. For the coefficient b — a 1% increase in x results in an approximate increase in
average y by b/100 (0.05 in this case), all other variables held constant.To get the exact amount,
we would need to take b× log(1.01), which in this case gives 0.0498.
5.3.4. log-log model
Let’s assume that after fitting the model we receive:
Once again focus on the interpretation of b. An increase in x by 1% results in 5% increase in
average (geometric) y, all other variables held constant. To obtain the exact amount, we need to
take
6. Standard Error:
Standard error is the difference between the means of the population and its sample.
Standard error is defined as the ratio of standard deviation of the sample divided by the
square root of the total number of observations.
Standard Error = SD / √N
SD = Standard Error
N = Total Number of Observations
46. Standard error is abbreviated as SE.
It is given in the same unit as the data.
6.1. Uses of Standard Error:
It helps to understand the difference between two samples.
It helps to calculate the size of the sample.
To determine whether the sample is drawn from a known population or not.
47. Unit V: Statistical Hypothesis Testing
1. Making Assumption:
Statistical hypothesis testing requires several assumptions.
These assumptions include
Considerations of the level of measurement of the variable.
The method of sampling, the shape of the population distribution.
The sample size.
The specific assumptions may vary, depending on the test or the conditions of testing.
However, without exception, all statistical tests assume random sampling.
As example, based on our data, we can test the hypothesis that the average price of gas in
California is higher than the average national price of gas. The test we are considering
meets these conditions:
The sample of California gas stations was randomly selected.
The variable price per gallon is measured at the interval-ratio level.
We cannot assume that the population is normally distributed.
2. Statistical Hypotheses:
A statistical hypothesis is an assumption about a population parameter.
This assumption may or may not be true. Hypothesis testing refers to the formal
procedures used by statisticians to accept or reject statistical hypotheses.
The best way to determine whether a statistical hypothesis is true would be to examine
the entire population.
Since that is often impractical, researchers typically examine a random sample from the
population.
If sample data are not consistent with the statistical hypothesis, the hypothesis is rejected.
There are two types of statistical hypotheses:
2.1. Null hypothesis: The null hypothesis, denoted by Ho, is usually the hypothesis that sample
observations result purely from chance.
48. 2.2. Alternative hypothesis: The alternative hypothesis, denoted by H1 or Ha, is the hypothesis
that sample observations are influenced by some non-random cause.
For example, suppose we wanted to determine whether a coin was fair and balanced. A null
hypothesis might be that half the flips would result in Heads and half, in Tails. The alternative
hypothesis might be that the number of Heads and Tails would be very different. Symbolically,
these hypotheses would be expressed as:
Ho: P = 0.5
Ha: P ≠ 0.5
Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given this result, we
would be inclined to reject the null hypothesis. We would conclude, based on the evidence, that
the coin was probably not fair and balanced.
3. Hypothesis Tests
Statisticians follow a formal process to determine whether to reject a null hypothesis, based on
sample data. This process, called hypothesis testing, consists of four steps.
State the hypotheses. This involves stating the null and alternative hypotheses. The
hypotheses are stated in such a way that they are mutually exclusive. That is, if one is
true, the other must be false.
Formulate an analysis plan. The analysis plan describes how to use sample data to
evaluate the null hypothesis. The evaluation often focuses around a single test statistic.
Analyze sample data. Find the value of the test statistic (mean score, proportion, t
statistic, z-score, etc.) described in the analysis plan.
Interpret results. Apply the decision rule described in the analysis plan. If the value of the
test statistic is unlikely, based on the null hypothesis, reject the null hypothesis.
4. Errors in Hypothesis Testing
Two types of errors can result from a hypothesis test:
Type I error. A Type I error occurs when the researcher rejects a null hypothesis when it
is true. The probability of committing a Type I error is called the significance level. This
probability is also called alpha, and is often denoted by α.
49. Type II error. A Type II error occurs when the researcher fails to reject a null hypothesis
that is false. The probability of committing a Type II error is called Beta, and is often
denoted by β. The probability of committing a Type II error is called the Power of the
test.
5. Decision Making Rules
The analysis plan includes decision rules for rejecting the null hypothesis. In practice,
statisticians describe these decision rules in two ways - with reference to a P-value or with
reference to a region of acceptance.
P-value: The strength of evidence in support of a null hypothesis is measured by the P-
value. Suppose the test statistic is equal to S. The P-value is the probability of observing a
test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less
than the significance level, we reject the null hypothesis.
Region of acceptance: The region of acceptance is a range of values. If the test statistic
falls within the region of acceptance, the null hypothesis is not rejected. The region of
acceptance is defined so that the chance of making a Type I error is equal to the
significance level.
The set of values outside the region of acceptance is called the region of rejection. If the
test statistic falls within the region of rejection, the null hypothesis is rejected. In such
cases, we say that the hypothesis has been rejected at the α level of significance.
These approaches are equivalent. Some statistics texts use the P-value approach; others use the
region of acceptance approach. On this website, we tend to use the region of acceptance
approach.
6. One-Tailed and Two-Tailed Tests
A test of a statistical hypothesis, where the region of rejection is on only one side of the sampling
distribution, is called a one-tailed test. For example, suppose the null hypothesis states that the
mean is less than or equal to 10. The alternative hypothesis would be that the mean is greater
than 10. The region of rejection would consist of a range of numbers located on the right side of
sampling distribution; that is, a set of numbers greater than 10.
A test of a statistical hypothesis, where the region of rejection is on both sides of the sampling
distribution, is called a two-tailed test. For example, suppose the null hypothesis states that the
50. mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or greater
than 10. The region of rejection would consist of a range of numbers located on both sides of
sampling distribution; that is, the region of rejection would consist partly of numbers that were
less than 10 and partly of numbers that were greater than 10.
7. Confidence Interval:
A confidence interval is how much uncertainty there is with any particular statistic.
Confidence intervals are often used with a margin of error.
It states how confident one can be that the results from a poll or survey reflect what was
expected to find if it were possible to survey the entire population.
Confidence intervals are intrinsically connected to confidence levels.
Confidence intervals consist of a range of potential values of the unknown population
parameter.
However, the interval computed from a particular sample does not necessarily include the
true value of the parameter.
Based on the (usually taken) assumption that observed data are random samples from a
true population, the confidence interval obtained from the data is also random.
The confidence level is designated prior to examining the data. Most commonly, the 95%
confidence level is used. However, other confidence levels can be used, for example,
90% and 99%.
Factors affecting the width of the confidence interval include the size of the sample, the
confidence level, and the variability in the sample.
A larger sample will tend to produce a better estimate of the population parameter, when
all other factors are equal.
A higher confidence level will tend to produce a broader confidence interval.
51. Unit VI: Test of Significance:
1. Steps in Testing Statistical Significance:
1. The first step is to specify the null hypothesis. For a two-tailed test, the null hypothesis is
typically that a parameter equals zero although there are exceptions. A typical null
hypothesis is μ1 - μ2 = 0 which is equivalent to μ1= μ2. For a one-tailed test, the null
hypothesis is either that a parameter is greater than or equal to zero or that a parameter is
less than or equal to zero. If the prediction is that μ1 is larger than μ2, then the null
hypothesis (the reverse of the prediction) is μ2 - μ1 ≥ 0. This is equivalent to μ1 ≤ μ2.
2. The second step is to specify the α level which is also known as the significance level.
Typical values are 0.05 and 0.01.
3. The third step is to compute the probability value (also known as the p value). This is the
probability of obtaining a sample statistic as different or more different from the
parameter specified in the null hypothesis given that the null hypothesis is true.
4. Finally, compare the probability value with the α level. If the probability value is lower
then you reject the null hypothesis. Keep in mind that rejecting the null hypothesis is not
an all-or-none decision. The lower the probability value, the more confidence you can
have that the null hypothesis is false. However, if your probability value is higher than
the conventional α level of 0.05, most scientists will consider your findings inconclusive.
Failure to reject the null hypothesis does not constitute support for the null hypothesis. It
just means you do not have sufficiently strong data to reject it.
2. Sampling Distribution of Mean and Standard Error:
The sampling distribution of a statistic is the distribution of that statistic, considered as
a random variable, when derived from a random sample of size n. It may be considered as the
distribution of the statistic for all possible samples from the same population of a given sample
size. The sampling distribution depends on the underlying distribution of the population, the
statistic being considered, the sampling procedure employed, and the sample size used. There is
often considerable interest in whether the sampling distribution can be approximated by
an asymptotic distribution, which corresponds to the limiting case either as the number of
52. random samples of finite size, taken from an infinite population and used to produce the
distribution, tends to infinity, or when just one equally-infinite-size "sample" is taken of that
same population.
2.1. Standard Error:
The standard error (SE) is very similar to standard deviation. Both are measures of spread. The
higher the number, the more spread out your data is. To put it simply, the two terms are
essentially equal — but there is one important difference. While the standard error
uses statistics (sample data) standard deviations use parameters (population data). (What is
the difference between a statistic and a parameter?).
In statistics, you’ll come across terms like “the standard error of the mean” or “the standard error
of the median.” The SE tells you how far your sample statistic (like the sample mean) deviates
from the actual population mean. The larger your sample size, the smaller the SE. In other words,
the larger your sample size, the closer your sample mean is to the actual population mean.
2.2. SE Calculation:
How you find the standard error depends on what stat you need. For example, the calculation is
different for the mean or proportion. When you are asked to find the sample error, you’re
probably finding the standard error. That uses the following formula: s/√n. You might be asked
to find standard errors for other stats like the mean or proportion.
2.3. Standard Error Formula:
The following tables show how to find the standard deviation (first table) and SE (second table).
That assumes you know the right population parameters. If you don’t know the population
parameters, you can find the standard error:
Sample mean.
Sample proportion.
Difference between means.
53. Difference between proportions.
Parameter (Population) Formula for Standard Deviation.
Sample mean, = σ / sqrt (n)
Sample proportion, p = sqrt [P (1-P) / n)
Difference between means. = sqrt [σ2
1/n1 + σ2
2/n2]
Difference between proportions. = sqrt [P1(1-P1)/n1 + P2(1-P2)/n2]
Statistic (Sample) Formula for Standard Error.
Sample mean, = s / sqrt (n)
Sample proportion, p = sqrt [p (1-p) / n)
Difference between means. = sqrt [s2
1/n1 + s2
2/n2]
Difference between proportions. = sqrt [p1(1-p1)/n1 + p2(1-p2)/n2]
Key for above tables:
P = Proportion of successes. Population.
p = Proportion of successes. Sample.
n = Number of observations. Sample.
n2 = Number of observations. Sample 1.
n2 = Number of observations. Sample 2.
54. σ2
1 = Variance. Sample 1.
σ2
2 = Variance. Sample 2.
2.4. Sampling Distribution of the Mean:
Definition: The Sampling Distribution of the Mean is the mean of the population from where
the items are sampled. If the population distribution is normal, then the sampling distribution of
the mean is likely to be normal for the samples of all sizes.
Following are the main properties of the sampling distribution of the mean:
Its mean is equal to the population mean, thus,
(?X͞ =sample mean and ?p Population mean)
The population standard deviation divided by the square root of the sample size is equal to the
standard deviation of the sampling distribution of the mean, thus:
(σ = population standard deviation, n = sample size)
The sampling distribution of the mean is normally distributed. This means, the distribution of
sample means for a large sample size is normally distributed irrespective of the shape of the
universe, but provided the population standard deviation (σ) is finite. Generally, the sample
size 30 or more is considered large for the statistical purposes. If the population is normal,
then the distribution of sample means will be normal, irrespective of the sample size.
σ͞x is a measure of precision through which the sample mean can be used to estimate the true
value of a population mean. ?σ͞x varies in direct proportion to the change in the original
population and inversely to the square of sample size ‘n’. Thus, the greater the variations in the
original items of the population greater the variation expected in sampling error in using ͞x as an
estimate of ?. It is to be noted that larger the sample size smaller is the standard error and vice-
versa.
55. 3. Large Sample Tests:
Some researchers choose to increase their sample size if they have an effect which is
almost within significance level.
This is done since the researcher suspects that he is short of samples, rather than that
there is no effect there. We need to be careful using this method, as it increases the
chances of creating a false positive result.
When we have a higher sample size, the likelihood of encountering Type-I and Type-II
errors occurring reduces, at least if other parts of our study is carefully constructed and
problems avoided.
Higher sample size allows the researcher to increase the significance level of the findings,
since the confidence of the result are likely to increase with a higher sample size.
This is to be expected because larger the sample size, the more accurately it is expected
to mirror the behavior of the whole group.
Therefore if you want to reject your null hypothesis, then you should make sure your
sample size is at least equal to the sample size needed for the statistical significance
chosen and expected effects.
4. Z- Test:
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution.
Because of the central limit theorem, many test statistics are approximately normally
distributed for large samples.
For each significance level, the Z-test has a single critical value (for example, 1.96 for
5% two tailed) which makes it more convenient than the Student's t-test which has
separate critical values for each sample size.
Therefore, many statistical tests can be conveniently performed as approximate Z-tests if
the sample size is large or the population variance is known.
If the population variance is unknown (and therefore has to be estimated from the sample
itself) and the sample size is not large (n < 30), the Student's t-test may be more
appropriate.
If T is a statistic that is approximately normally distributed under the null hypothesis, the
next step in performing a Z-test is to estimate the expected value θ of T under the null
hypothesis, and then obtain an estimate s of the standard deviation of T.
56. After that the standard score Z = (T − θ) / s is calculated, from which one-tailed and two-
tailed p-values can be calculated as Φ(−Z) (for upper-tailed tests), Φ(Z) (for lower-tailed
tests) and 2Φ(−|Z|) (for two-tailed tests) where Φ is the standard normal cumulative
distribution function.
5. T- Test:
When the difference between two population averages is being investigated, a t test is
used.
In other words, a t test is used when we wish to compare two means (the scores must be
measured on an interval or ratio measurement scale). For example, we would use a t test
if we wished to compare the reading achievement of boys and girls.
With a t test, we have one independent variable and one dependent variable. The
independent variable (gender in this case) can only have two levels (male and female).
The dependent variable would be reading achievement. If the independent had more than
two levels, then we would use a one-way analysis of variance (ANOVA).
The test statistic that a t test produces is a t-value. Conceptually, t-values are an extension
of z-scores. In a way, the t-value represents how many standard units the means of the
two groups are apart.
With a t test, the researcher wants to state with some degree of confidence that the
obtained difference between the means of the sample groups is too great to be a chance
event and that some difference also exists in the population from which the sample was
drawn.
In other words, the difference that we might find between the boys’ and girls’ reading
achievement in our sample might have occurred by chance, or it might exist in the
population.
If our t test produces a t-value that results in a probability of .01, we say that the
likelihood of getting the difference we found by chance would be 1 in a 100 times.
We could say that it is unlikely that our results occurred by chance and the difference we
found in the sample probably exists in the populations from which it was drawn.
5.1. Paired and Unpaired T- test:
T-tests are useful for comparing the means of two samples. There are two types: paired
and unpaired.
Paired means that both samples consist of the same test subjects. A paired t-test is
equivalent to a one-sample t-test.
Unpaired means that both samples consist of distinct test subjects. An unpaired t-test is
equivalent to a two-sample t-test.
57. For example, if you wanted to conduct an experiment to see how drinking an energy
drink increases heart rate, you could do it two ways.
The "paired" way would be to measure the heart rate of 10 people before they drink the
energy drink and then measure the heart rate of the same 10 people after drinking the
energy drink. These two samples consist of the same test subjects, so you would perform
a paired t-test on the means of both samples.
The "unpaired" way would be to measure the heart rate of 10 people before drinking an
energy drink and then measure the heart rate of some other group of people who have
drank energy drinks. These two samples consist of different test subjects, so you would
perform an unpaired t-test on the means of both samples.
6. Parametric and Non parametric tests:
6.1. Definition of Parametric Test
The parametric test is the hypothesis test which provides generalizations for making statements
about the mean of the parent population. A t-test based on Student’s t-statistic, which is often
used in this regard.
The t-statistic rests on the underlying assumption that there is the normal distribution of variable
and the mean in known or assumed to be known. The population variance is calculated for the
sample. It is assumed that the variables of interest, in the population are measured on an interval
scale.
6.2. Definition of Nonparametric Test
The nonparametric test is defined as the hypothesis test which is not based on underlying
assumptions, i.e. it does not require population’s distribution to be denoted by specific
parameters.
The test is mainly based on differences in medians. Hence, it is alternately known as the
distribution-free test. The test assumes that the variables are measured on a nominal or ordinal
level. It is used when the independent variables are non-metric.
In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that it is
equally likely that a randomly selected value from one sample will be less than or greater than a
randomly selected value from a second sample.
Key Differences between Parametric and Non-parametric Tests
58. The fundamental differences between parametric and nonparametric test are discussed in the
following points:
1. A statistical test, in which specific assumptions are made about the population parameter,
is known as the parametric test. A statistical test used in the case of non-metric
independent variables is called nonparametric test.
2. In the parametric test, the test statistic is based on distribution. On the other hand, the test
statistic is arbitrary in the case of the nonparametric test.
3. In the parametric test, it is assumed that the measurement of variables of interest is done
on interval or ratio level. As opposed to the nonparametric test, wherein the variable of
interest are measured on nominal or ordinal scale.
4. In general, the measure of central tendency in the parametric test is mean, while in the
case of the nonparametric test is median.
5. In the parametric test, there is complete information about the population. Conversely, in
the nonparametric test, there is no information about the population.
6. The applicability of parametric test is for variables only, whereas nonparametric test
applies to both variables and attributes.
7. For measuring the degree of association between two quantitative variables, Pearson’s
coefficient of correlation is used in the parametric test, while spearman’s rank correlation
is used in the nonparametric test.
7. Chi Square Test:
A chi-squared test, also written as χ2
test, is any statistical hypothesis test where
the sampling distribution of the test statistic is a chi-squared distribution when the null
hypothesis is true. Without other qualification, 'chi-squared test' often is used as short
for Pearson's chi-squared test.
The chi-squared test is used to determine whether there is a significant difference
between the expected frequencies and the observed frequencies in one or more
categories.
In the standard applications of this test, the observations are classified into mutually
exclusive classes, and there is some theory, or say null hypothesis, which gives the
probability that any observation falls into the corresponding class.
The purpose of the test is to evaluate how likely the observations that are made would be,
assuming the null hypothesis is true.
59. Chi-squared tests are often constructed from a sum of squared errors, or through
the sample variance.
Test statistics that follow a chi-squared distribution arise from an assumption of
independent normally distributed data, which is valid in many cases due to the central
limit theorem.
A chi-squared test can be used to attempt rejection of the null hypothesis that the data are
independent.
60. Unit VII: Experimental Designs
1. Principles of Experimental Design:
The basic principles of experimental design are (i) Randomization, (ii) Replication and
(iii) Local Control.
1.1. Randomization:
Randomization is the cornerstone underlying the use of statistical methods in experimental
designs. Randomization is the random process of assigning treatments to the experimental units.
The random process implies that every possible allotment of treatments has the same probability.
For example, if number of treatment = 3 (say, A, B, and C) and replication = r = 4, then the
number of elements = t * r = 3 * 4 = 12 = n. Replication means that each treatment will appear 4
times as r = 4. Let the design is
A C B C
C B A B
A C B A
Note from the design elements 1, 7, 9, 12 are reserved for treatment A, element 3, 6, 8 and 11 are
reserved for Treatment B and elements 2, 4, 5 and 10 are reserved for Treatment C. P(A)= 4/12,
P(B)= 4/12, and P(C)=4/12, meaning that Treatment A, B, and C have equal chances of its
selection.
1.2. Replication:
The second principle of an experimental design is replication, which is a repetition of the
basic experiment. In other words, it is a complete run for all the treatments to be tested in the
experiment. In all experiments, some kind of variation is introduced because of the fact that
the experimental units such as individuals or plots of land in agricultural experiments cannot
be physically identical. This type of variation can be removed by using a number of
experimental units. We therefore perform the experiment more than once, i.e., we repeat the
basic experiment. An individual repetition is called a replicate. The number, the shape and
61. the size of replicates depend upon the nature of the experimental material. A replication is
used to:
(i) Secure a more accurate estimate of the experimental error, a term which represents the
differences that would be observed if the same treatments were applied several times to the
same experimental units;
(ii) Decrease the experimental error and thereby increase precision, which is a measure of the
variability of the experimental error; and
1.3. Local Control:
It has been observed that all extraneous source of variation is not removed by randomization and
replication, i.e. unable to control the extraneous source of variation.
Thus we need to a refinement in the experimental technique. In other words, we need to choose a
design in such a way that all extraneous source of variation is brought under control. For this
purpose we make use of local control, a term referring to the amount of (i) balancing, (ii)
blocking and (iii) grouping of experimental units.
Balancing: Balancing means that the treatment should be assigned to the experimental units in
such a way that the result is a balanced arrangement of treatment.
Blocking: Blocking means that the like experimental units should be collected together to far
relatively homogeneous groups. A block is also a replicate.
The main objective/ purpose of local control is to increase the efficiency of experimental design
by decreasing the experimental error.
2. Longitudinal Study:
A longitudinal study (or longitudinal survey, or panel study) is a research design that
involves repeated observations of the same variables (e.g., people) over short or long
periods of time (i.e., uses longitudinal data).
It is often a type of observational study, although they can also be structured as
longitudinal randomized experiments.
Longitudinal studies are often used in social-personality and clinical psychology, to study
rapid fluctuations in behaviors, thoughts, and emotions from moment to moment or day
to day; in developmental psychology, to study developmental trends across the life span.
62. Longitudinal studies can be retrospective (looking back in time, thus using existing data
such as medical records or claims database) or prospective (requiring the collection of
new data).
3. Cross Sectional Study:
In medical research and social science, a cross-sectional study (also known as a cross-
sectional analysis, transverse study, prevalence study) is a type of observational study
that analyzes data from a population, or a representative subset, at a specific point in
time—that is, cross-sectional data.
In medical research, cross-sectional studies differ from case-control studies in that they
aim to provide data on the entire population under study, whereas case-control studies
typically include only individuals with a specific characteristic, with a sample, often a
tiny minority, of the rest of the population.
Cross-sectional studies are descriptive studies (neither longitudinal nor experimental).
The study may be used to describe some feature of the population, such as prevalence of
an illness, or they may support inferences of cause and effect.
Longitudinal studies differ from both in making a series of observations more than once
on members of the study population over a period of time.
4. Prospective and Retrospective Study:
4.1. Prospective study
It is an epidemiologic study in which the groups of individuals (cohorts) are selected
on the bases offactors that are to be examined for possible effects on some outcome.
For example, the effect of exposure to a specificrisk factor on the eventual development
of a particular disease can be studied.
The cohorts are then followed over aperiod of time to determine the incidence rates of the
outcomes being studied as they relate to the original factors. Called also cohort study.
The term prospective usually implies a cohort selected in the present and followed into
the future, but this method can also be applied to existing longitudinal historical data,
such as insurance or medical records.
A cohort is identified and classified as to exposure to the risk factor at some date in the
past and followed up to the present to determine incidence rates. This is called a
historical prospective study, prospective study of past data, or retrospective cohort study.
63. 4.2. Retrospective study:
It is an epidemiologic study in which participating individuals are classified
as either having some outcome (cases) or lacking it (controls).
The outcome may be a specific disease, and the persons' histories are examined for
specific factors that might be associated with that outcome.
Cases and controls are often matched with respect tocertain demographic or other
variables but need not be.
As compared to prospective studies, retrospective studies suffer from drawbacks: certain
important statistics cannot be measured, and large biases may be introduced both in the
selection of controls and in the recall of past exposure to risk factors.
The advantage of the retrospective study is its smallscale, usually short time for
completion, and its applicability to rare diseases, which would require study of very large
cohorts in prospective studies.
5. Randomized Block:
The blocks method was introduced by S. Bernstein.
In the statistical theory of the design of experiments, blocking is the arranging
of experimental units in groups (blocks) that are similar to one another.
Typically, a blocking factor is a source of variability that is not of primary interest to the
experimenter.
In Probability Theory the blocks method consists of splitting a sample into blocks
(groups) separated by smaller sub-blocks so that the blocks can be considered almost
independent.
The blocks method helps proving limit theorems in the case of dependent random
variables.
Example:
Gender
Treatment
Placebo Vaccine
Male 250 250
Female 250 250
64. Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly
assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the
placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.
6. Simple Factorial Design:
Factorial design is one of the many experimental designs used in psychological
experiments where two or more independent variables are simultaneously manipulated to
observe their effects on the dependent variables.
A simple factorial design is an experimental design where 2 or more levels of
each variable are observed in combination 2 or more levels of each variable.
Example:
A university wants to assess the starting salaries of their MBA graduates. The study looks at
graduates working in four different employment areas: accounting, management, finance, and
marketing. In addition to looking at the employment sector, the researchers also look at gender.
In this example, the employment sector and gender of the graduates are the independent
variables, and the starting salaries are the dependent variables. This would be considered a 4×2
factorial design.
7. Analysis of Variance (ANOVA):
Analysis of Variance (ANOVA) is a statistical method used to test differences between
two or more means. It may seem odd that the technique is called “Analysis of Variance”
rather than “Analysis of Means.”
As we can see, the name is appropriate because inferences about means are made by
analyzing variance. ANOVA is used to test general rather than specific differences
among means.
An ANOVA conducted on a design in which there is only one factor is called a ONE-
WAY ANOVA.
If an experiment has two factors, then the ANOVA is called a TWO-WAY ANOVA.
Example: Suppose an experiment on the effects of age and gender on reading speed were
conducted using three age groups (8 years, 10 years, and 12 years) and the two genders (male
65. and female). The factors would be age and gender. Age would have three levels and gender
would have two levels.
8. Analysis of RBD:
A Reliability Block Diagram (RBD) is a graphical representation of the components of
the system and how they are reliability-wise related.
The diagram represents the functioning state (i.e., success or failure) of the system in
terms of the functioning states of its components.
For example, a simple series configuration indicates that all of the components must
operate for the system to operate; a simple parallel configuration indicates that at least
one of the components must operate, and so on.
When we define the reliability characteristics of each component, we can use software to
calculate the reliability function for the entire system and obtain a wide variety of system
reliability analysis results, including the ability to identify critical components and
calculate the optimum reliability allocation strategy to meet a system reliability goal.
9. Meta-analysis:
A meta-analysis is a statistical analysis that combines the results of multiple scientific
studies.
Meta-analysis can be performed when there are multiple scientific studies addressing the
same question, with each individual study reporting measurements that are expected to
have some degree of error.
The aim then is to use approaches from statistics to derive a pooled estimate closest to the
unknown common truth based on how this error is perceived.
Existing methods for meta-analysis yield a weighted average from the results of the
individual studies, and what differs is the manner in which these weights are allocated
and also the manner in which the uncertainty is computed around the point estimate thus
generated.
In addition to providing an estimate of the unknown common truth, meta-analysis has the
capacity to contrast results from different studies and identify patterns among study
66. results, sources of disagreement among those results, or other interesting relationships
that may come to light in the context of multiple studies.
10. Systematic Review:
Systematic reviews are a type of literature review that uses systematic methods to collect
secondary data, critically appraise research studies, and synthesize findings qualitatively
or quantitatively.
Systematic reviews formulate research questions that are broad or narrow in scope, and
identify and synthesize studies that directly relate to the systematic review question.
They are designed to provide a complete, exhaustive summary of current evidence
relevant to a research question.
For example, systematic reviews of randomized controlled trials are key to the practice
of evidence-based medicine, and a review of existing studies is often quicker and cheaper
than embarking on a new study.
While systematic reviews are often applied in the biomedical or healthcare context, they
can be used in other areas where an assessment of a precisely defined subject would be
helpful.
Systematic reviews may examine clinical tests, public health interventions,
environmental interventions, social interventions, adverse effects, and economic
evaluations.
11. Ethics in Statistics:
Good statistical practice is fundamentally based on transparent assumptions, reproducible results,
and valid interpretations. In some situations, guideline principles may conflict, requiring
individuals to prioritize principles according to context. However, in all cases, stakeholders have
an obligation to act in good faith, to act in a manner that is consistent with these guidelines, and
to encourage others to do the same. Above all, professionalism in statistical practice presumes
the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical
ends is inherently unethical.
Ethical statistical practice does not include, promote, or tolerate any type of professional or
scientific misconduct, including, but not limited to, bullying, sexual or other harassment,
discrimination based on personal characteristics, or other forms of intimidation.
67. A. Professional Integrity and Accountability: The ethical statistician uses methodology and
data that are relevant and appropriate; without favoritism or prejudice; and in a manner intended
to produce valid, interpretable, and reproducible results. The ethical statistician does not
knowingly accept work for which he/she is not sufficiently qualified, is honest with the client
about any limitation of expertise, and consults other statisticians when necessary or in doubt. It is
essential that statisticians treat others with respect.
The ethical statistician:
1. Identifies and mitigates any preferences on the part of the investigators or data providers that
might predetermine or influence the analyses/results.
2. Employs selection or sampling methods and analytic approaches appropriate and valid for the
specific question to be addressed, so that results extend beyond the sample to a population
relevant to the objectives with minimal error under reasonable assumptions.
3. Respects and acknowledges the contributions and intellectual property of others.
4. When establishing authorship order for posters, papers, and other scholarship, strives to make
clear the basis for this order, if determined on grounds other than intellectual contribution.
5. Discloses conflicts of interest, financial and otherwise, and manages or resolves them
according to established (institutional/regional/local) rules and laws.
6. Accepts full responsibility for his/her professional performance. Provides only expert
testimony, written work, and oral presentations that he/she would be willing to have peer
reviewed.
7. Exhibits respect for others and, thus, neither engages in nor condones discrimination based on
personal characteristics; bullying; unwelcome physical, including sexual, contact; or other forms
of harassment or intimidation, and takes appropriate action when aware of such unethical
practices by others.
B. Integrity of data and methods: The ethical statistician is candid about any known or
suspected limitations, defects, or biases in the data that may affect the integrity or reliability of
68. the statistical analysis. Objective and valid interpretation of the results requires that the
underlying analysis recognizes and acknowledges the degree of reliability and integrity of the
data.
The ethical statistician:
1. Acknowledges statistical and substantive assumptions made in the execution and interpretation
of any analysis. When reporting on the validity of data used, acknowledges data editing
procedures, including any imputation and missing data mechanisms.
2. Reports the limitations of statistical inference and possible sources of error.
3. In publications, reports, or testimony, identifies who is responsible for the statistical work if it
would not otherwise be apparent.
4. Reports the sources and assessed adequacy of the data, accounts for all data considered in a
study, and explains the sample(s) actually used.
5. Clearly and fully reports the steps taken to preserve data integrity and valid results.
6. Where appropriate, addresses potential confounding variables not included in the study.
7. In publications and reports, conveys the findings in ways that are both honest and meaningful
to the user/reader. This includes tables, models, and graphics.
8. In publications or testimony, identifies the ultimate financial sponsor of the study, the stated
purpose, and the intended use of the study results.
9. When reporting analyses of volunteer data or other data that may not be representative of a
defined population, includes appropriate disclaimers and, if used, appropriate weighting.
10. To aid peer review and replication, shares the data used in the analyses whenever
possible/allowable and exercises due caution to protect proprietary and confidential data,
including all data that might inappropriately reveal respondent identities.
69. 11. Strives to promptly correct any errors discovered while producing the final report or after
publication. As appropriate, disseminates the correction publicly or to others relying on the
results.
C. Responsibilities to Science/Public/Funder/Client: The ethical statistician supports valid
inferences, transparency, and good science in general, keeping the interests of the public, funder,
client, or customer in mind (as well as professional colleagues, patients, the public, and the
scientific community).
The ethical statistician:
1. To the extent possible, presents a client or employer with choices among valid alternative
statistical approaches that may vary in scope, cost, or precision.
2. Strives to explain any expected adverse consequences of failure to follow through on an
agreed-upon sampling or analytic plan.
3. Applies statistical sampling and analysis procedures scientifically, without predetermining the
outcome.
4. Strives to make new statistical knowledge widely available to provide benefits to society at
large and beyond his/her own scope of applications.
5. Understands and conforms to confidentiality requirements of data collection, release, and
dissemination and any restrictions on its use established by the data provider (to the extent
legally required), protecting use and disclosure of data accordingly. Guards privileged
information of the employer, client, or funder.
D. Responsibilities to Research Subjects: The ethical statistician protects and respects the
rights and interests of human and animal subjects at all stages of their involvement in a project.
This includes respondents to the census or to surveys, those whose data are contained in
administrative records and subjects of physically or psychologically invasive research.
The ethical statistician: