Introductory Statistics Brian Wells, MSM, MPH With contributions from I. Alan Fein, MD, MPH
“ It’s all about observing what you see.” Yogi Berra, Great American philosopher Intro to Statistics
Statistics – What’s It All About? The hunt for the truth. Finding relationships and causality Separating the wheat from the chaff [lots of chaff out there!] A process of critical thinking and an analytic approach to the literature and research It can help you avoid contracting the dreaded “data rich but information poor” syndrome!
Statistics and the “Truth” Can we ever know the truth? Statistics is a way of telling us the likelihood that we have arrived at the truth of a matter [or not!].
Statistics – Based on Probability and Multiple Assumptions An approach to searching for the truth which recognizes that there is little, if anything, which is concrete or dichotomous Employs quantitative concepts like “confidence”, “reliability”, and “significance” to get at the “truth”
Statistics & the “Truth” Two kinds of statistics: Descriptive – describes “what is” Experimental – makes a point – tries to “prove” something Problem: Almost impossible to prove something, but much easier to disprove Thus: Null hypothesis H o  i.e., there is no difference
Types of Statistics  Descriptive: describe / communicate what you see without any attempts at generalizing beyond the sample at hand Inferential: determine the likelihood that observed results: Can be generalized to larger populations Occurred by chance rather than as a result of known, specified interventions [ H o ]
The Null Hypothesis A hypothesis which is tested for possible rejection under the assumption that it is true (usually that observations are the result of chance).  Experimental stats works to disprove the null hypothesis, to show that the null hypothesis is wrong, that a difference exists. [e.g., glucose levels and diabetics] In other words, that you have found or discovered something new. The null hypothesis usually represents the opposite of what the researcher may believe to be true.
Normality  When variability in data points is due to the sum of numerous independent sources, with no one source dominating, the result should be a normal, or Gaussian distribution (named for Karl Friedrich Gauss). Note: technically, true Gaussian distributions do not occur in nature: Gaussian distributions extend infinitely in both directions.  Bell shaped curves are the norm for biological data, with end points to the right and left of the mean. Bell curve vs. normal distribution vs. Gaussian distribution Normal distribution is unfortunately named because it encourages the fallacy that many or all probability distributions are "normal".
Why Is the Normal Distribution Important? Provides the fundamental mathematical substrate permitting the use and calculation of most statistical analyses. The ND represents one of the empirically verified elementary “truths about the general nature of reality” and its status can be compared with one of the fundamental laws of natural sciences.  The exact shape of the ND is defined by a function which has only two parameters: mean and standard deviation.
Gaussian Distribution
Why Examine the Data? Means are the most commonly used measures of populations Amenable to mathematical manipulation Handy measure of central tendency BUT – can be misleading Examine the curve, median, mode, range, and outliers Look for bi- or multi-modal distributions
Beware small samples! Fancy statistical tests can bury the truth here! Conversely, not finding any difference means nothing Normality testing is pointless when there are less than 20-30 data points – can be misleading
Variability of Measurements Unbiasedness: tendency to arrive at the true or correct value Precision: degree of spread of series of observations [repeatability] [also referred to as reliability] Can also refer to # of decimal points – can be misleading Accuracy: encompasses both unbiasedness and precision.  Accurate measurements are both unbiased and precise.  Inaccurate measurements may be either biased or imprecise or both.
The Mean    vs X  Potentially dangerous and misleading value    =  true population mean Xbar = mean of sample
Standard Deviation SD is a measure of scatter or dispersion of data around the mean ~68% of values fall within 1 SD of the mean [+/- one Z] ~95% of values fall within 2 SD of the mean, with ~2.5% in each tail.
Single Sample Means The mean value you calculate from your sample of data points depends on which values you happened to sample and is unlikely to equal the true population mean exactly.
Confidence Intervals A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter.
Confidence Intervals The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision). A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter. Confidence intervals are more informative than the simple results of hypothesis tests (where we decide "reject Ho" or "don't reject Ho") since they provide a range of plausible values for the unknown parameter.
Errors of Analysis/Detection Type 1 [alpha] error – finding a significant difference when none really exists -  usually due to random error Type 2 [beta] error – not finding a difference when there is in fact a difference – more likely to occur with smaller sample sizes. Chosen significance levels will impact both of these
Errors of Analysis   TRUE STATE DECISION  Null Hypothesis is Correct  Null Hypothesis is wrong    Reject Null Hypothesis as Incorrect – i.e., find a difference   Type I  ERROR  probability =        CORRECT  probability = 1-    "power"  Accept Null Hypothesis – decide that there are no effects or differences CORRECT probability = 1 -      Type II ERROR  probability =   
Statistical Significance The probability that the findings are due to chance alone p<.05 – less than a 5% likelihood of  findings being due to chance P<.05 is commonly used but arbitrary
Two-Sample  t -Test for Means (unequal variance) Used to determine if two population means are equal H o  :   1  =   2  H a :   1  =   2 Test Statistic:  DF: Reject  H o  if:
Critical Values Assuming that alpha = 0.05 (this is a standard measure), we can say the following: lim υ => ∞  t (α/2, υ)  = 1.96   So, for  υ   > 100, T-Critical = 1.96 For all other values, a  t- Value table is needed
Chart of Critical Values
What does it mean? Rejecting H o  with a positive T-value means that it is likely that the provider has a higher mean billing level at the 95% confidence level Again, it’s probability – this does not mean we know for sure that the provider’s mean billing level is higher than his/her peers Result can be skewed if underlying data is sufficiently non-Gaussian
Analytics Recommendations Test for equivariance using F-distribution Add test for normality to database (e.g. Kolmogorov-Smirnov, Shapiro-Wilk) Use non-parametric test or transformation when data is non-Gaussian Can not discriminate between a physician billing higher levels because of upcoding and a physician simply seeing sicker patients than his/her peers – need clinical outcomes measure
Thank you! Wells 2006

Introductory Statistics

  • 1.
    Introductory Statistics BrianWells, MSM, MPH With contributions from I. Alan Fein, MD, MPH
  • 2.
    “ It’s allabout observing what you see.” Yogi Berra, Great American philosopher Intro to Statistics
  • 3.
    Statistics – What’sIt All About? The hunt for the truth. Finding relationships and causality Separating the wheat from the chaff [lots of chaff out there!] A process of critical thinking and an analytic approach to the literature and research It can help you avoid contracting the dreaded “data rich but information poor” syndrome!
  • 4.
    Statistics and the“Truth” Can we ever know the truth? Statistics is a way of telling us the likelihood that we have arrived at the truth of a matter [or not!].
  • 5.
    Statistics – Basedon Probability and Multiple Assumptions An approach to searching for the truth which recognizes that there is little, if anything, which is concrete or dichotomous Employs quantitative concepts like “confidence”, “reliability”, and “significance” to get at the “truth”
  • 6.
    Statistics & the“Truth” Two kinds of statistics: Descriptive – describes “what is” Experimental – makes a point – tries to “prove” something Problem: Almost impossible to prove something, but much easier to disprove Thus: Null hypothesis H o i.e., there is no difference
  • 7.
    Types of Statistics Descriptive: describe / communicate what you see without any attempts at generalizing beyond the sample at hand Inferential: determine the likelihood that observed results: Can be generalized to larger populations Occurred by chance rather than as a result of known, specified interventions [ H o ]
  • 8.
    The Null HypothesisA hypothesis which is tested for possible rejection under the assumption that it is true (usually that observations are the result of chance). Experimental stats works to disprove the null hypothesis, to show that the null hypothesis is wrong, that a difference exists. [e.g., glucose levels and diabetics] In other words, that you have found or discovered something new. The null hypothesis usually represents the opposite of what the researcher may believe to be true.
  • 9.
    Normality Whenvariability in data points is due to the sum of numerous independent sources, with no one source dominating, the result should be a normal, or Gaussian distribution (named for Karl Friedrich Gauss). Note: technically, true Gaussian distributions do not occur in nature: Gaussian distributions extend infinitely in both directions. Bell shaped curves are the norm for biological data, with end points to the right and left of the mean. Bell curve vs. normal distribution vs. Gaussian distribution Normal distribution is unfortunately named because it encourages the fallacy that many or all probability distributions are &quot;normal&quot;.
  • 10.
    Why Is theNormal Distribution Important? Provides the fundamental mathematical substrate permitting the use and calculation of most statistical analyses. The ND represents one of the empirically verified elementary “truths about the general nature of reality” and its status can be compared with one of the fundamental laws of natural sciences. The exact shape of the ND is defined by a function which has only two parameters: mean and standard deviation.
  • 11.
  • 12.
    Why Examine theData? Means are the most commonly used measures of populations Amenable to mathematical manipulation Handy measure of central tendency BUT – can be misleading Examine the curve, median, mode, range, and outliers Look for bi- or multi-modal distributions
  • 13.
    Beware small samples!Fancy statistical tests can bury the truth here! Conversely, not finding any difference means nothing Normality testing is pointless when there are less than 20-30 data points – can be misleading
  • 14.
    Variability of MeasurementsUnbiasedness: tendency to arrive at the true or correct value Precision: degree of spread of series of observations [repeatability] [also referred to as reliability] Can also refer to # of decimal points – can be misleading Accuracy: encompasses both unbiasedness and precision. Accurate measurements are both unbiased and precise. Inaccurate measurements may be either biased or imprecise or both.
  • 15.
    The Mean  vs X Potentially dangerous and misleading value  = true population mean Xbar = mean of sample
  • 16.
    Standard Deviation SDis a measure of scatter or dispersion of data around the mean ~68% of values fall within 1 SD of the mean [+/- one Z] ~95% of values fall within 2 SD of the mean, with ~2.5% in each tail.
  • 17.
    Single Sample MeansThe mean value you calculate from your sample of data points depends on which values you happened to sample and is unlikely to equal the true population mean exactly.
  • 18.
    Confidence Intervals Aconfidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter.
  • 19.
    Confidence Intervals Thewidth of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision). A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter. Confidence intervals are more informative than the simple results of hypothesis tests (where we decide &quot;reject Ho&quot; or &quot;don't reject Ho&quot;) since they provide a range of plausible values for the unknown parameter.
  • 20.
    Errors of Analysis/DetectionType 1 [alpha] error – finding a significant difference when none really exists - usually due to random error Type 2 [beta] error – not finding a difference when there is in fact a difference – more likely to occur with smaller sample sizes. Chosen significance levels will impact both of these
  • 21.
    Errors of Analysis TRUE STATE DECISION Null Hypothesis is Correct Null Hypothesis is wrong   Reject Null Hypothesis as Incorrect – i.e., find a difference   Type I ERROR probability =       CORRECT probability = 1-   &quot;power&quot; Accept Null Hypothesis – decide that there are no effects or differences CORRECT probability = 1 -     Type II ERROR probability =  
  • 22.
    Statistical Significance Theprobability that the findings are due to chance alone p<.05 – less than a 5% likelihood of findings being due to chance P<.05 is commonly used but arbitrary
  • 23.
    Two-Sample t-Test for Means (unequal variance) Used to determine if two population means are equal H o :  1 =  2 H a :  1 =  2 Test Statistic: DF: Reject H o if:
  • 24.
    Critical Values Assumingthat alpha = 0.05 (this is a standard measure), we can say the following: lim υ => ∞ t (α/2, υ) = 1.96 So, for υ > 100, T-Critical = 1.96 For all other values, a t- Value table is needed
  • 25.
  • 26.
    What does itmean? Rejecting H o with a positive T-value means that it is likely that the provider has a higher mean billing level at the 95% confidence level Again, it’s probability – this does not mean we know for sure that the provider’s mean billing level is higher than his/her peers Result can be skewed if underlying data is sufficiently non-Gaussian
  • 27.
    Analytics Recommendations Testfor equivariance using F-distribution Add test for normality to database (e.g. Kolmogorov-Smirnov, Shapiro-Wilk) Use non-parametric test or transformation when data is non-Gaussian Can not discriminate between a physician billing higher levels because of upcoding and a physician simply seeing sicker patients than his/her peers – need clinical outcomes measure
  • 28.