Lecture 4 - probability distributions (2).pptx

SDM Lecture 4
Probability Distributions

• Probability distributions
• We have discussed some basic ways in
which we can describe a set of actual
statistical data:
– Mean, median and mode as measures of
central tendency
– Variance, Standard Deviation, Coefficient of
Variation as measures of spread.
– Histograms as a way of plotting the overall
distribution of data.
• But to draw useful conclusions from
statistical data, we need to go beyond
description, to analysis.

• To do this, we need to have some idea, or model
of the type of distribution of data we might
expect a priori. We can then compare the actual
data with the prior expectations.
• For example, a new drug is being tested to treat
a disease. The drug is given to a group of
patients, while a placebo (coloured, flavoured
liquid) is given to a control group, with neither
patients nor doctors knowing which group is
which.
• Records are kept as to how quickly/whether
patients recover in each group.
• Hopefully, more who get the drug will recover.
But if this is the case, how can we tell if this is
purely down to chance, or if the drug really is
working?

• We need an idea of what the distribution of
recovery rates would be likely to be if it was
purely down to chance. How often would 10%
recover, how often would 20% recover, etc.
• Only then can we say, for example “50% of the
treatment group recovered, compared to only
30% of the control group. There would be a less
than 1% chance of this difference occurring
randomly.”
• Thus, to conduct meaningful statistical analysis,
we need to know about the shape of
distributions of different statistics we might
expect beforehand.
• This is based on probability theory – where we
make assumptions and deductions about the
probability of different events happening or
different values of data occurring.

Some standard probability
distributions
• Binomial - when the underlying probability
experiment has only two possible outcomes
(e.g. tossing a coin)
• Normal - when many small independent
factors influence a variable (e.g. height,
influenced by genes, diet, etc.)
• Poisson - for rare events, when the
probability of occurrence is low

• Tossing a coin – the binomial distribution
• We toss a coin 100 times. We get 60 heads.
Does this suggest that the coin is biased
towards heads?
• Put another way, if the coin were fair, that is
equally likely to give heads or tails, how likely
would it be to get as many as 60 to 40 of one
side to the other?
• Or put another way: suppose we were to do the
experiment of 100 coin tosses a large number of
times, with a fair coin, and record all the
resulting scores from 0 to 100 heads. How often
would we expect to get 0 heads? How often 1
head? And so on, up to 100. What would we
expect the mean and the standard deviation to
be? What would a histogram of these scores be
expected to look like?

• The resulting distribution is called a binomial
distribution, as it is based on two possible
outcomes in each trial.
• Let’s take an easier case, where we toss a coin
6 times and count the heads. There are seven
possible scores we can get for the number of
heads – 0 up to 6. How likely is each to occur?
• We can think about this in terms of the different
possible sequences of heads and tails we might
get – e.g. HHTHTT or HTTTTH.
• If the coin is fair, each sequence is equally likely.
So what we need to do is count up the number
of possible sequences that can give each
number of heads.

• How many possible sequences are there?
Well, there are six trials, and each can
have two possible outcomes. This means
there are 2x2x2x2x2x2 possible
sequences, or 26 = 64.
• How many of these sequences give
exactly 0 heads? Clearly only 1: TTTTTT.
• How many give 1 head? There are 6 of
these, as the sole head can occur in any
of the six tosses – that is, we can get
HTTTTT, THTTTT, TTHTTT, TTTHTT,
TTTTHT or TTTTTH.
• What about 2 heads? There are rather
more possibilities.

• HHTTTT
• HTHTTT
• HTTHTT
• HTTTHT
• HTTTTH
• THHTTT
• THTHTT
• THTTHT
• THTTTH
• TTHHTT
• TTHTHT
• TTHTTH
• TTTHHT
• TTTHTH
• TTTTHH
That is, there are 15 possibilities.
Similarly, we can show there are 20
ways of getting three heads. (There is
a formula, for any number of trials,
and for any number of ‘successes’, and
for any probability of success in a
single trial – but we won’t do that
here.)
The rest is easy – the number of ways
of getting 4 heads is the same as the
number of ways of getting 2 tails –
that is 15. Likewise, there are 6 ways
of getting 5 heads, and 1 way of
getting 6 heads. (HHHHHH). Thus, the
distribution is symmetric.

No. of heads Frequency
0 1
1 6
2 15
3 20
4 15
5 6
6 1
Note, this is not a histogram of actual data, but of the expected distribution
based on a particular model of the situation – namely, six independent tosses
of a fair coin.
We could also calculate the mean and standard deviation of this distribution –
you might not be surprised to learn that the mean is 3, and it can be shown
that the variance is 1.5. Again, this is not the mean and variance of a set of
observed data, but are anticipated properties of the distribution based on a
theoretical model.

• The Normal Distribution
• The binomial distribution is an example of a
discrete distribution – one where the variable in
question can take a certain number of discrete
values (e.g. 0,1,2,3,…).
• Other distributions are continuous – that is, they
can take any value, or any value in a given range –
for example the height, weight or income of a
randomly selected individual.
• A lot of variables in the real world tend to have a
particular shape of distribution – the normal
distribution – and a lot of statistical analysis is
based on the assumption that certain variables
follow some sort of normal distribution.

• Any Normal Distribution
• Bell-shaped
• Symmetric about mean
• Continuous
• Never touches the x-axis
• Total area under curve is 1.00
• Approximately 68% lies within 1 standard
deviation of the mean, 95% within 2
standard deviations, and 99.7% within 3
standard deviations of the mean.

• Data values represented by x which has
mean mu and standard deviation sigma.
• Probability Function given by

14
The Standard Normal Curve
• We fix the horizontal scale so that units
of standard deviation are used (Z
values) instead of X values
• All normal distributions are now the
same
• Area = Probability
• Total Area under curve = 1 or 100%

• Standard Normal Distribution
• Same as a normal distribution, but also..
• Mean is zero
• Variance is one
• Standard Deviation is one
• Data values represented by z.
• Probability Function given by

16
The Standard Normal Curve
Z
0 1 2 3
-1
-2
-3
𝑍 =
X − mean
Standard deviation
Eg If mean = 40 and standard deviation = 10
When x = 50 z = 1
When x = 30 z = -1
When x = 60 z = 2

17
Areas under the Curve
Z
0 1 2 3
-1
-2
-3
This area is 0.6826
68.26% of values are within 1 standard
deviation of the mean
There is a probability of 0.68 that a value
will lie in this region

18
continued
• 68.26% of the values are within + 1
standard deviation of the mean
standard deviations of the mean
standard deviations of the mean

19
Z values
𝑍 =
X − mean
Standard deviation
For the Population we use
mean =
standard deviation =






x
z

20
Table of Areas
Z
0 1 2 3
-1
-2
-3
Area to the left of
the tail is given in
tables
Areas to the left of positive z values are
tabulated
eg z = 1, area to left of tail = 0.8413
z = 1.5, area = 0.9332
z = 1.52, area = 0.9357
Z value

21
Example: Find Prob(x>46)
= 40
= 4


Z = (46-40)/4
= 1.5
Proportion is 6.68%
Z
0 1 2 3
-1
-2
-3
Z = 1.5
Area = ?
a)




x
z

22
continued
b) Find prob(x<42)
z = (42-40)/4
= 0.5
Area = 0.6915
69.15%
c) Find prob(42<x<46)
z1 = 1.5, z2= 0.5
Area = 0.3085-0.0668
= 0.2417
= 24.17%
Z
0 1 2 3
-1
-2
-3
Z
0 1 2 3
-1
-2
-3

23
Continued
Z
0 1 2 3
-1
-2
-3
Area is 5%
or .05
Area = 0.05, z = 1.645
1.645 standard deviations above the mean is
40+ 1.645 x 4
=46.58
This means that 90% of values lie between
33.42 and 46.58
Z = ?
d)

Graph of men’s and women’s heights
140 145 150 155 160 165 170 175 180 185 190 195 200
Height in centimetres
Men
Women A graph of a continuous
distribution – a probability
density function – shows the
relative likelihood of getting
different values and ranges
of values for our variable.
Thus, in this example,
values nearest 166 and 174
are most common, and as
you get further from these
means, the proportion of
observations with these
values declines.

Parameters of the distribution
• The two parameters of the Normal distribution are the
mean  and the variance 2
x ~ N(, 2)
• Men’s heights are Normally distributed with mean 174 cm
and variance 92.16
xM ~ N(174, 92.16)
• Women’s heights are Normally distributed with a mean of
166 cm and variance 40.32
xW ~ N(166, 40.32)
A normal distribution can have any mean and standard deviation.
We denote the mean by μ, and the standard deviation by σ. (So
the variance is σ2.)
The mean μ will be the value in the middle of the normal curve –
thus it is also the median and the mode.

• So What?
• If we know the area under the curve this will tell us the
proportion of the population falling within certain values
of the variable e.g. height.
140 145 150 155 160 165 170 175 180 185 190 195 200
Men
Women
We could calculate the
area by using a
mathematical
technique called
integration and
applying it to the
(very complicated)
formula for the area
under a normal curve.
Fortunately we do not
have to do this and a
nice person has
calculated the areas
for THE STANDARD
NORMAL CURVE.

The Standard Normal
distribution
The Standard Normal distribution has a mean μ = 0 and a standard
deviation σ = 1 unlike for example our distribution of women’s
heights with a mean of 166cm and a standard deviation of 6.35cm .
All other Normal distributions can be generated from this Standard
Normal distribution through a variable called z, which relates the
mean and standard deviation of any Normal distribution to the
Standard Normal distribution.




x
z

Areas under the distribution
• What proportion of women are taller than 175
cm?
140 145 150 155 160 165 170 175 180 185 190 195 200
Need this area

• How many standard deviations is 175 above
166?
• The standard deviation is 40.32 = 6.35, hence
• so 175 lies 1.42 s.d’s above the mean
• How much of the Normal distribution lies beyond
1.42 s.d’s above the mean? Use tables of the
STANDARD NORMAL DISTRIBUTION
42
.
1
35
.
6
166
175



z
Areas under the distribution

The standard Normal distribution
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.00 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.10 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.20 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.30 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.40 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.50 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.60 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
0.70 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.80 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.90 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.00 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.10 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.20 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.30 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.40 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.50 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1 – 0.9222 = 0.0778

Answer
• 7.78% of women are taller than 175 cm.
• Summary: to find the area in the tail of the
distribution, calculate the z-score, giving the
number of standard deviations between the
mean and the desired height. Then look the z-
score up in tables.

• The standard deviation will tell us how spread
out the values are likely to be around the mean.
The following give an idea of the meaning of σ in
a given distribution:
– About 68% of the observations should fall within one
standard deviation either side of the mean.
– About 95% of the observations should fall within two
standard deviations either side of the mean.
– About 99.7% of the observations should fall within
three standard deviations of the mean.
• For example, suppose we have a normal
distribution with mean 100 and standard
deviation 10. Then, if we were to take random
samples from this distribution, 68% of the values
would be expected to fall between 90 and 110,
95% of the values between 80 and 120, and
99.7% between 70 and 130. Less than one in
300 values would fall outside this range.

Summary
• Most statistical problems concern random
variables which have an associated probability
distribution
• Common distributions are the Binomial,
Normal and Poisson (there many others)
• Once the appropriate distribution for the
problem is recognised, the solution is
relatively straightforward

Lecture 4 - probability distributions (2).pptx

Recommended

Recommended

More Related Content

Similar to Lecture 4 - probability distributions (2).pptx

Similar to Lecture 4 - probability distributions (2).pptx (20)

Recently uploaded

Recently uploaded (20)

Lecture 4 - probability distributions (2).pptx