Biostatistics, part 1, Descriptive statistics: Key concepts
Biostatistics, part 1, Descriptive
statistics: Key concepts
• Population, sample, and individual
• What kinds of data? Continuous vs.
• How do we summarize data? Statistics
(numerical summaries) and graphics.
• Measures of central tendency and
• Standard error and 95% confidence
“Aristotle maintained that women have fewer teeth than men;
although he was twice married, it never occurred to him to verify
this statement by examining his wives’ mouths.” -- Sir Bertrand
Russell, The Impact of Science on Society, 1952.
“It is a capital mistake to theorize before you have data.” -- Sir
Arthur Conan Doyle, Scandal in Bohemia.
And, for another viewpoint:
“If your experiment needs statistics, you ought to have done a better
experiment.” Ernest Rutherford.
The bench science perspective: you can control all the variables!
Clinicians, however, know better … human variation is large, and
often inexplicable. Statistics help us describe it and generalize at
least enough to improve our ability to practice medicine.
Populations, Samples, and Individuals
Aristotle speculated about the population of all women (compared
to the population of men). He had immediately available to him a
sample of two women, and he could have counted the number of
teeth for two individuals.
The population is the collection of all people about whom you
would like to ask a research question. This might be a fairly clear-
cut easily defined set of people:
“What proportion of people 65 or older in the US today
have Alzheimer’s disease?”
Or it might be a more hypothetical group:
“How much of a reduction in symptomatic days could a
person expect if treated with a new antiviral for flu?”
Typically, you can’t study everyone in the population.
You can’t afford to have everyone 65 or older in the
US seen by a neurologist, even if you could find all the old
You can’t test everyone with the flu because the cases
haven’t even occurred yet!
So you study a sample, and you try to generalize to the
population. The sample size is the number of individuals in the
sample (not the number of measurements you make on each
A good study design will help make your sample
representative of the population you are concerned about.
Good statistical analysis will help tell you the best answer to
your question about the population, and also how far off you
All biostatistics begins with description. Before you do anything
else, you look at the data and summarize the data. Our goal in this
hour is to show you how to get a first look at the data and get ready
to do more elaborate procedures. A statistic is just a numerical
summary of the data, like the largest number in the data set.
Descriptive statistics should be clear and easily interpreted. They
should not mislead you about the data they are summarizing.
“A habit of basing convictions upon
evidence, and of giving to them only that
degree of certainty which the evidence
warrants, would, if it became general, cure
most of the ills from which the world
suffers.” -- Bertrand Russell
Looking at data: categorical or continuous?
Most data fall into two broad classes.
Continuous data are used to report a measurement of the individual
that can take on any value within an acceptable range. For example,
age, systolic BP, [K+], change in weight over 6 months.
Categorical data are used to report a characteristic of the individual
that has a finite, usually small number of possibilities. The
categories should be clear cut, not overlapping, and cover all the
possibilities. For example, sex (male or female), vital status (alive
or dead), disease stage (depends on disease), ever smoked (yes or
Make sure you are very clear about the definitions. Does “one
cigarette and I didn’t inhale” count as smoking?
When designing a study, allow for missing values and refusals.
An example to work with:
A hypothetical clinical trial in small cell lung cancer (SCLC).
Often advanced when diagnosed, poor prognosis.
Many SCLC tumors express receptor tyrosine kinase, KIT.
Blocking KIT might help; previous trials show little benefit.
A novel drug: binds selectively to activated e-KIT receptor.
Preliminary results: may help reverse KIT action.
So we will examine a randomized, double-blind clinical trial of this
new drug: BST-TIK.
Features of this study:
1. Not enough data to know whether should restrict to
patients whose tumors express KIT.
2. Design: double-blind, randomized, two-arm study. One
arm is standard chemo (cisplatin, irinotecan), other is
standard chemo plus BST-TIK. Total n=500.
3. Primary endpoint: overall survival.
4. Secondary endpoints: toxicity - major (neutropenia,
thrombocytopenia); minor (diarrhea).
5. Possible markers: KIT expression before, after.
6. Demographics: age, sex.
Summarizing categorical data:
Frequency, proportion, percentages in categories.
Male: 321 (64.2%) (overall)
By arm of study:
Standard therapy: Kit expression: 67.6%
New therapy: KIT expression: 69.3%
Note: don’t carry every decimal place imaginable:
Note: categorizing continuous data loses information.
A second way to summarize
categorical data: graphics
• Bar graphs for
categories that are
• Histograms if you got
categories by dividing
up continuous data.
• Bars do not touch,
Summarizing continuous data: Measures of central tendency
Measures of central tendency tell you in some sense where you
might expect a “typical” person to be, in the middle of the data.
The mean is the arithmetic average. For example, if 3 people were
in hospital 8, 10 and 30 days respectively, the mean time is 48/3 =
16 days. But if they were 8, 10 and 12, the mean is 30/3 = 10
days. Note: mean is sensitive to outliers!
The median is the value at which half the numbers are higher and
half are lower. If number of individuals is odd, it is the middle
value (rank (n+1)/2) and if number is even, it is average of two
middle values. Note that median in both examples above is 10.
Not sensitive to outliers!
A patient might want to know median; an insurer the mean.
The mode is the most common value; rarely used.
Measures of central dispersion for baseline KIT expression:
Overall: Mean =10.3, median = 11.0, mode = 0.0.
Patients expressing KIT: Mean = 15.1, median 15.0, mode 13.0
If data are long tailed to right, mean will be > median, influenced
by those high-valued outliers. Here mean roughly = median, a
good sign that data are fairly symmetric. The picture looks more or
less bell-shaped, after we take out those who do not express KIT.
Features to look for in pictures:
Symmetry vs. skewness
Short tails vs. outliers
Bell-shaped vs. very peaked, very flat, or multiple peaks.
Another piece of the picture: measures of spread.
The simplest is the range, largest - smallest. Very sensitive to
outliers. Almost worthless for doing any real statistics.
More useful: measures based on percentiles; the median is also
known as the 50th percentile, because half the data are less than
that value. The 25th and 75th percentiles are called the quartiles,
because one-quarter and three-quarters, respectively, of the data
fall below them. The difference between the quartiles is the inter-
quartile range. Some epidemiologists also work with tertiles,
quintiles, or deciles.
The most useful measure for biostatistics work is the standard
deviation. It is based on the average of the squared distances from
the mean. (Then the square root is taken to make the units come
out right - that is, same units as the original measurement.)
Interpreting the standard deviation: related to bell-shaped curve
If your data are nicely behaved and follow a bell-shaped
distribution curve (also known as the normal or Gaussian
distribution), the standard deviation tells you a lot about how far
any one individual might stray from the mean.
For a bell-shaped distribution:
Two-thirds of the individuals will lie within one standard deviation
below the mean to one standard deviation above the mean.
95% of the individuals will lie within two standard deviations
below the mean to two standard deviations above the mean.
Hardly anyone will ever fall outside three standard deviations
above or below the mean.
Mean KIT expression was 15.1, SD 6.5.
We should find that two-thirds of the data fall between 8.6 and
We would expect to find around 5% out below 2.1 or above 28.1,
and in fact this is just about right.
We have one outlier in this data set (more than 3 SD out).
How accurate is our guess at the mean?
Suppose we’d like to say that mean BL KIT is 15.1.
We haven’t seen ALL people with SCLC who express KIT, just
342 of them.
How sure can we be about that estimate of 15.1? Could we be off
by 5? 1? How can you guess without studying all patients?
Answer: We can’t be completely sure about this group of 342
patients, but we know a lot about how the scientific process of
taking a random sample and finding its average will behave. And
we hope our sample reflects a somewhat “random” process!
Two key facts about our scientific process:
1. The means from random samples like ours are centered around
the true population mean. That is, our process is unbiased.
2. The means from random samples like ours have approximately
a bell-shaped distribution, that gets closer and closer to the true
population mean, as the sample size gets bigger. The more data
you get, the more precise your guess at the population mean.
The yardstick for how close the sample is to the truth is called the
standard error. It is the standard deviation (how much a single
individual might differ from the mean) divided by √n. So the
more data we have, the closer our sample mean should be to the
truth, since almost all random samples will be very close to each
other and to the true mean.
In our data set, the standard error is 6.5/ √342 =0.35.
In fact, since the means from all the random samples we might
have gotten follow a bell-shaped distribution, we know that
95% of them should be within two standard errors of the truth.
So we guess that the truth is somewhere within two standard
errors above our mean or two standard errors below it. We call
this a 95% confidence interval.
For example, our estimate was 15.1and our standard error
turned out to be 035, so two standard errors would be 0.7. A
95% confidence interval for the mean would be about from
14.4 to 15.8. This is usually written (14.4, 15.8). We have 95%
confidence that the population mean lies in this interval,
because we are using a scientific procedure that works like that!
We know that 95% of our studies will give us an interval that
covers the truth (but we will be off in 5% of our studies.)
Suppose we wanted to get a confidence interval that was narrower.
We can improve our precision by increasing the sample size.
The more data you have, the more you know about the population,
and the better your guesses about the population mean, or any other
population characteristic of interest.
If we wanted to cut the width in half (make the study twice as
precise) we would have to sample 4 times as many people
The precision of our study only increases like the square root of n,
not like n. So quadrupling the sample size only cuts the standard
error in half.
So it takes some planning in advance to design a study that will
meet your goals, for a reasonable cost.
Now a follow-up question. How likely is it that an individual in
such a study would have a KIT expression as high as 21?
We claim that normal patients average KIT expression of 15.1
(95% CI,14.4-15.8). How much higher or lower would a person
have to be to seem “unusual”?
No, 21 is not unusual! No matter how well we know the MEAN,
individuals don’t have to sit right on top of it. The standard error
refers to how well we can estimate the center. The standard
deviation refers to how well we can guess any individual.
So we wouldn’t find 21 especially surprising. We probably
wouldn’t be surprised by any value between two standard
deviations above and below the mean.
What will you find reported in the medical literature?
Most studies will summarize central tendency by the mean if the
data look normal, and by the median otherwise.
Some papers will report the standard deviation, some the
standard error, and some both. They are not always labeled! Be
careful! People often show a graph with an “error bar”; could be
If the data are oddly behaved (skewed, multiple peaks, very long
or short tails), people often report the median and percentiles
instead of mean and standard deviation.
Keep your basic scientific question in mind:
Do you want to ask about the average or typical person?
Or do you want to figure out how unusual your patient might be?