SlideShare a Scribd company logo
1 of 254
Biostatistics
Mengistu Y. (BSC, MPH-HI, PhD fellow, Assi. Prof. PH)
2022
Learning Objectives
General Objective
♦ To provide the statistical methods and numerical descriptions that is useful to generate
information about certain situations and present them in such a way that valid interpretations
are possible
Specific Objectives
♦ design, organize, present and summarize data
♦ understand the process involved in data collection and processing
♦ distinguish between categorical and numeric data
♦ understand probabilities and their applications
♦ interpret summary statistics, graphical displays and contingency tables commonly presented in
the health literature
♦ carry out exploratory data analysis
♦ understand the process involved in estimations and hypothesis testing
♦ interpret the functions of confidence intervals and p-values
♦ give an interpretation or reach a conclusion about a population on the basis of information
contained in a sample drown from that population.
Course content
♦ Introduction to the course
♦ Data and Scales of measurement
♦ Methods of data organization and presentation
♦ Frequency distribution
♦ Measures of central tendency and dispersion
♦ Basic principles of probability
♦ Rules of probability and applications (additive,
multiplicative, Bayes')
References: (available in the Library)
1. Gordis, L. (2009). Epidemiology (4th ed.). USA. Elsevier Inc.
2. Koepsell & Weiss. Epidemiologic Methods. Oxford University Press, 2003.
3. Last (ed.) Dictionary of Epidemiology, 1995
4. Rothman, Kenneth J.; Greenl and, Sander; Lash, Timothy L. Modern Epidemiology, 3rd Ed
Lippincott Williams & Wilkins.2008.
5. Martin Bland. An introduction to Medical Statistics
6. Colton T. Statistics in Medicine
7. Daniel W. Biostatistics a foundation for analysis in the Health Sciences
8. Kirkwood BR. Essentials of Medical Statistics
9. Knapp RG, Miller MC. Clinical epidemiology and Biostatistics. Baltimore Williams and Wilkins, 1992
10. Pagano & Gauvereau. Principles of Biostatistics
11. Schelesslman, J.J. Case control studies, Design, Conduct, Analysis, Oxford University Press, New York,
1982
12. Breslow, N.E. Statistical Methods in cancer Research, Volume I-The analysis of case-control studies
Statistics is the science of gaining information from data through
 collecting data
 organizing
 Summarizing data
 Presenting data
 analysing and drawing conclusion (inferences) from data.
 It is helpful to think of the process of data analysis as consisting of three
stages: management, descriptive and inferential
Introduction
5
Definitions
• Statistics: is used to mean either statistical data or statistical
methods.
• Statistical data:
• When it means statistical data it refers to numerical descriptions of things.
• These descriptions may take the form of counts or measurements.
NB Even though statistical data always denote figures (numerical
descriptions) it must be remembered that all 'numerical descriptions'
are not statistical data
• Statistical methods:
• It refers to a body of methods that are used for collecting, organising,
summarizing, analysis and interpreting numerical data for
understanding a phenomenon or making wise decisions.
6
5/12/2023
Definitions…
• Biostatistics is the application of different statistical methods for
biological, medical and public health data
• A population is any specific collection of objects of interest.
• A sample is any subset or sub-collection of the population
• A census is the case that the sample consists of the whole population.
5/12/2023 7
Definitions ...
• A measurement is a number or attribute computed for each member
of a population or of a sample.
• A parameter is the characteristics of the population as a whole.
• A statistic is the characteristics of the sample data.
• Descriptive statistics is a study of data: involves organizing,
displaying, and describing properties of the data
• Inferential statistics is drawing conclusions about a population of
interest based on information contained in the sample taken from the
population.
5/12/2023 8
Definitions …
• The distinction between a population together with its parameters
and a sample together with its statistics is a fundamental concept in
inferential statistics.
population sample
9
5/12/2023
Statistics
parameters
Inference
Definition …
• A Variable is a characteristic which takes different values in different
persons, places, or things. In general it is a characteristic which
takes different values.
• Variables are things that we measure, control, or manipulate in
research.
♦Data: are measurements or observations (value) recorded for each
element. For example, data include record on weight, length,
breaking strength, age, sex, religion, marital status, income etc.
Based on the nature of the variables we can have qualitative and
quantitative data.
Dependent vs. Independent
Independent variable:
♦ A variable that you believe might influence your outcome measure.
♦ This might be a variable that you control, like a treatment, or a variable not under
your control, like an exposure. It also might represent a demographic factor like age or
gender.
♦ An independent variable is a hypothesized cause of the dependent variable
• Any variable that you are using to make those predictions is an independent
variable.
• Example: The relationship of dietary fat consumption and the development of
ischemic stroke.
In this study, the independent variables were:
Percentage of total fat in the diet,
Dependent variable
• In a research, the variable that you believe might be influenced or modified by some
treatment or exposure.
• It may also represent the variable you are trying to predict.
• The dependent variable is called the outcome variable. This definition depends on the context
of the study.
• Example: A study examined the relationship of dietary fat consumption and the development
of ischemic stroke.
• In this study, the dependent variable was incidence of ischemic stroke.
Characteristics of statistical data
i) They must be in aggregates – are 'number of facts.' A single fact,
even though numerically stated, cannot be called statistics.
ii) They must be affected to a marked extent by a multiplicity of causes.
This means that statistics are aggregates of such facts only as grow out
of a ' variety of circumstances'. Thus the explosion of outbreak is
attributable to a number of factors, e.g. Human factors, parasite
factors and environmental factors.
iii) They must be enumerated or estimated according to a reasonable
standard of accuracy. If statistical data is incorrect the results are bound
to be misleading.
13
5/12/2023
Characteristics…
iv) They must have been collected in a systematic manner for a
predetermined purpose. Numerical data can be called statistics only if
they have been compiled in a properly planned manner and for a
purpose about which the enumerator had a definite idea.
v) They must be placed in relation to each other. That is, they must be
comparable. Numerical facts may be placed in relation to each other
either in point of time, space or condition.
14
5/12/2023
Source of data
• Routine data collection
• Routine health unit and community data
• Activity data about patients seen and programmes run, routine
services and epidemiological surveillance;
• Semi-permanent data about the population served, the facility
itself and staff that run it
• Vital registration
• Non-routine data collection
• Surveys
• Population census (headcounts proportion/facility catchment’s area)
• Quantitative or qualitative rapid assessments.
5/12/2023 15
Techniques of data collection
Data collection is a crucial stage in the planning and implementation
of a study
If the data collection has been superficial, biased or incomplete,
data analysis becomes difficult, and the research report will be of
poor quality.
Therefore, we should concentrate all possible efforts on developing
appropriate tools, and should test them several times.
16
Observation: is a technique that involves systematically
selecting, watching and recording behavior and
characteristics of living things, objects or phenomena.
• Observation of human behavior is a much-used data
collection technique. It can be undertaken in different
ways;
• Participant observation: The observer takes part in the
situation he or she observes.
• Non-participant observation: The observer watches the
situation, openly or concealed, but does not participate
Techniques of collecting data con’td
17
• Observations can give additional, more accurate information
on behavior of people than interviews or questionnaires
• Observations can also be made on objects;
• For example, the presence or absence of a latrine and its state
of cleanliness may be observed.
• Here observation would be the major research technique
Data collection techniques con’d
18
• Interview (face-to-face): is a data-collection technique that
involves oral questioning of respondents, either individually or as
a group.
• Answers to the questions posed during an interview can be
recorded by writing them down (either during the interview itself
or immediately after the interview) or by tape-recording the
responses, or by a combination of both.
Data collection techniques con’d
19
• Administer written questionnaire: is a data collection tool in
which written questions are presented that are to be answered by
the respondents in written form
• A written questionnaire can be administered in different ways, such
as by:
Sending questionnaires by mail with clear instructions on how to answer
the questions and asking for mailed responses;
Gathering all or part of the respondents in one place at one time, giving
oral or written instructions, and letting the respondents fill out the
questionnaires;
Hand-delivering questionnaires to respondents and collecting them later
Data collection techniques con’d
20
Types of questions
• Depending on how questions are asked and recorded
we can distinguish two major possibilities
1. Open-ended questions: (allowing for completely open
as well as partially categorized answers)
It permit free responses which should be recorded in
the respondents' own words.
21
Types of questions
Such questions are useful for obtaining in-depth information
on:
• facts with which the researcher is not very familiar,
• opinions, attitudes and suggestions of informants, or
• sensitive issues.
22
Types of questions
• Example;
1. 'What is your opinion on the services provided in the ANC?' (Explain
why.)
2. 'What do you think are the reasons some adolescents in this area start
using drugs?
3. 'What would you do if you noticed that your daughter (school girl) had
a relationship with someone?'
23
Types of questions
• Advantage of open-ended questions
• Allow you to probe more deeply into issues of interest being
raised.
• Information provided in the respondents' own words might be
useful
• Risks of completely open-ended questions
• A big risk is incomplete recording of all relevant issues covered in
the discussion.
• Analysis is time-consuming and requires experience; otherwise
important data may be lost.
24
Types of questions
2. Closed questions: have a list of possible options or answers
from which the respondents must choose
Closed questions are most commonly used for background
variables such as age, marital status or education, although in
the case of age and education you may also take the exact
values and categorise them during data analysis
25
Types of questions
1. 'Women who have induced abortion should be severely punished.‘
26
Types of questions
2. Did you eat any of the following foods yesterday?' (Circle yes if at least one
item in each set of items is eaten.)
27
Types of questions
• Advantages of closed questions
• It saves time
• Comparing responses of different groups, or of the same group over time,
becomes easier.
• Risks of closed questions:
• In case of illiterate respondents, bias will be introduce
28
Steps in designing questionnaire
1. Content: Take your objectives and variables as a starting point
2. Formulating questions: Formulate one or more questions that will
provide the information needed for each variable.
 Check whether each question measures one thing at
a time.
 Avoid leading questions.
 Ask sensitive questions in a socially acceptable way:
29
Steps in designing questionnaire
3. Sequencing the questions: Design your interview
schedule or questionnaire to be 'informant friendly‘
4. Formatting the questionnaire:
When you finalize your questionnaire, be sure that
 A separate, introductory page is attached to each
questionnaire
30
Steps in designing questionnaire
explaining the purpose of the study
 requesting the informant's consent to be interviewed
assuring confidentiality of the data obtained.
• Each questionnaire has a heading and space to insert the number,
date and location of the interview
• You may add the name of the interviewer, to facilitate quality
control.
31
Steps…
5. Translation
6. Pre-test:
32
 Focus group discussions: It allows a group of 8 - 12 informants to freely discuss a
certain subject with the guidance of a facilitator or reporter
 In-depth interview
 Key informant interview
For qualitative study
33
Rationale of studying statistics
• Why do we need to use statistics
• – The reason is: Presence of variability
• Statistics pervades a way of organizing information on a wider and
more formal basis
• More and more things are now measured quantitatively in medicine
and public health
• There is a great deal of intrinsic (inherent) variation in most biological
processes
• Public health and medicine are becoming increasingly quantitative.
As technology progresses, the physician encounters more and more
quantitative rather than descriptive information.
34
5/12/2023
Rationale….
• The planning, conduct, and interpretation of much of medical
research are becoming increasingly reliant on statistical technology. Is
this new drug or procedure better than the one commonly in use?
How much better? What, if any, are the risks of side effects associated
with its use?
• Statistics pervades the medical literature.
35
5/12/2023
Limitations of statistics
1. It deals on aggregates of facts and no importance is attached to
individual items–suited only if their group characteristics are desired to
be studied.
2. Statistical data are only approximately and not mathematically
correct.
36
5/12/2023
Data and types of data
• Qualitative (or categorical) data consist of values that can be
separated into different categories that are distinguished by some
nonnumeric characteristic.
• Cannot be measured in quantitative form but can only be identified by name
or categories
• Quantitative data consist of values representing counts or
measurements. Expressed numerically and they can be of two types
(discrete or continuous).
37
5/12/2023
Types of Quantitative Data
• Continuous data can take on any value in a given interval.
Continuous data values results from some continuous scale that
covers a range of values without gaps, interruptions, or jumps.
• Discrete data can take on only particular distinct values and not other
values in between. The values in discrete data is either a finite
number or a countable number.
38
5/12/2023
Scale of measurement
• Nominal
• Ordinal
• Interval
• Ratio
• Nominal and ordinal are qualitative (categorical) levels of
measurement.
• Interval and ratio are quantitative levels of measurement.
39
5/12/2023
Types of Variables
• Variable types can be distinguished based on their scale, Typically,
different statistical methods are appropriate for variables of different
scales
scale Characteristic questions Examples
Nominal Is A different than B? Marital status, Eye color, Gender,
Religious affiliation, Race
Ordinal Is A bigger than B? Stage of disease
Severity of pain
Level of satisfaction
Interval By how many units do A and
B differ?
Temperature
Ratio How many times bigger than
B is A?
Distance, Length
Time until death
Weight
40
5/12/2023
Operations that make sense for variables of
different scales
Scale Operation that make sense
Counting Ranking Addition/
subtraction
Multiplication/
Division
Nominal  .
Ordinal  .  .
Interval  .  .  .
Ratio  .  .  .  .
41
5/12/2023
TYPES OF QUALITATIVE MEASUREMENTS
• Nominal level of measurement—classifies data into names, labels or
categories in which no order or ranking can be imposed.
Example: Sex ( M, F)
Exam result (P, F)
Blood Group (A,B, O or AB)
Color of Eyes (blue, green,
brown, black)
42
5/12/2023
• Ordinal level of measurement—classifies data into categories that can be
ordered or ranked, but precise differences between the ranks do not exist.
Generally it does not make sense to do calculations with data at the
ordinal level.
Example:
Response to treatment
(poor, fair, good)
Severity of disease
(mild, moderate, severe)
Income status (low, middle,
high)
43
5/12/2023
TYPES OF QUANTITATIVE MEASUREMENTS
• Interval level of measurement—ranks data, precise differences
between units of measure exist, but there is no meaningful zero. If a
zero exists, it is an arbitrary point. Example—IQ scores, it makes
sense to talk about someone having an IQ 20 points higher than
another person, but an IQ of zero has no meaning.
• Ratio level of measurement—has all the characteristics of the interval
level, but a true zero exists. Also, true ratios exist when the same
variable is measured on two different members of the population.
Example—weight of an individual. It makes sense to say that a 150 lb
adult weighs twice as much as a 75 lb. child.
44
5/12/2023
Copyright © 2009 Pearson Education, Inc.
summarizes the possible data types and levels of measurement.
Figure 1 Data types and levels of measurement.
45
5/12/2023
Data organization and presentation
• Statistics is used to organize and interpret research
observations and findings.
• Before interpretation & communication of the
findings, the raw data must be organized and
presented in a clear and understandable way.
Techniques used to organize and summarize a set of
data in a concise way.
• Organization of data
• Summarization of data
• Presentation of data
46
5/12/2023
Cont...
• Numbers that have not been summarized and
organized are called raw data
Descriptive statistic includes tables, graphical
/chart displays and calculation of summary
measures such as mean, proportions, averages
etc…
• The methods of describing variables differ
depending on the type of data (Numerical or
Categorical).
47
5/12/2023
Organizing data
Categorical data
• Table of frequency
distributions
• Frequency
• Relative frequency
• Cumulative frequencies
• Graphs
• Bar charts
• Pie charts
Continuous or discrete data
• Frequency distribution
tables
• Summary measures
Graphs
• Histograms
• Frequency polygons
• Cumulative frequency
polygons
Leaf and steam
Box and whisker Plots
Scatter plot
48
5/12/2023
Frequency distributions
• A frequency distribution is a presentation of the
number of times (or the frequency) that each value (or
group of values) occurs in the study population.
• Ordered array: A simple arrangement of individual
observations in order of magnitude.
• A simple and effective way of summarizing categorical
data is to construct a frequency distribution table.
• This is done by counting the number of observations
falling into each of the categories, or levels of the
variables.
• Consider for example, the variable birth weight with
levels ‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’.
49
5/12/2023
Relative Frequency
• Sometimes it is useful to compute the proportion, or
percentages of observations in each category.
• The distribution of proportions is called the relative
frequency distribution of the variable.
• Given a total number of observations, the relative
frequency distribution is easily derived from the
frequency distribution.
50
5/12/2023
Cumulative frequency
• Two other distributions are useful describing particularly
ordinal data.
• It tells nothing in nominal data.
E.g. You will never say 70% are below blue color.
• The cumulative frequency is the number of
observations in the category plus observations in all
categories smaller than it.
• Cumulative relative frequency is the proportion of
observations in the category plus observations in all
categories smaller than it, and is obtained by dividing the
cumulative frequency by the total number of
observations.
51
5/12/2023
Table 2. Distribution of birth weight of newborns
between 1976-1996 at TAH.
BWT Freq. Rel. Freq(%) Cum. Freq Cum.rel.freq.(%)
Very low 43 0.4 43 0.4
Low 793 8.0 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100_____
Total 9974 100
52
5/12/2023
Frequency distribution for numerical data
• Ordered array, further useful summarization may be
achieved by grouping the data.
• To group a set of observations we select a set of
continuous, non overlapping intervals such that each
value in the set of observations can be placed in one,
and only one, of the intervals.
• These intervals are usually referred to as class intervals.
53
5/12/2023
•One of the first considerations when data are to be
grouped is how many intervals to include
•The question is how best can we organize such
data. Imagine when we have huge data set
which may not be manageable by eye.
5/12/2023 54
Table 3. Frequencies of serum cholesterol levels for 1067 US males of
ages 25-34, (1976-1980).
-------------------------------------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq Cum freq Cum.rel. freq
------------------------------------------------------------------------------------------------------------------
80-119 13 1.2 13 1.2
120-159 150 14.1 163 15.3
160-199 442 41.4 605 56.7
200-239 299 28.0 904 84.7
240-279 115 10.8 1019 95.5
280-319 34 3.2 1053 98.7
320-359 9 0.8 1062 99.5
360-399 5 0.5 1067 100
------------------------------------------------------------------------------------------------------------------
Total 1067 100
55
5/12/2023
For both discrete and continuous data the values are
grouped into non-overlapping intervals, usually of equal
width.
56
5/12/2023
Example of raw data of age….
57
5/12/2023
Example of categorized data of age
58
5/12/2023
How to calculate class interval?
To determine the number of class intervals and the
corresponding width, we use:
 Sturge’s rule:
K=1+3.322(logn)
W=L-S
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
59
5/12/2023
Example
• Construct a grouped frequency distribution of the
following data on the amount of time (in hours) that
80 college students devoted to leisure activities during
a typical school week:
5/12/2023 60
Example:
5/12/2023 61
The amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week
• Using the above formula,
K = 1 + 3.322  log (80)
= 7.32  7 classes
• Maximum value = 38 and Minimum value = 10
• w= Range/k = (38 – 10)/7= 28/7 = 4
• Using width of 5(common rule of thumb), we can construct grouped
frequency distribution for the above data as:
5/12/2023 62
5/12/2023 63
Mid-point and True-limits
Mid-point (class mark): The value of the interval
which lies midway between the lower and the upper
limits of a class.
True limits(class boundaries): Are those limits that
make an interval of a continuous variable continuous
in both directions
Used for smoothening of the class intervals
Subtract 0.5 from the lower and add it to the upper limit
64
5/12/2023
Contd…
• Note. In the construction of cumulative frequency distribution, if we
start the cumulation from the lowest size of the variable to the
highest size, the resulting frequency distribution is called `Less than
cumulative frequency distribution' and if the cumulation is from the
highest to the lowest value the resulting frequency distribution is
called `more than cumulative frequency distribution.' The most
common cumulative frequency is the less than cumulative frequency
5/12/2023 65
Example
Time
(Hours)
True limit Mid-point Frequency
10-14
15-19
20-24
25-29
30-34
35-39
9.5 – 14.5
14.5 – 19.5
19.5 – 24.5
24.5 – 29.5
29.5 – 34.5
34.5 - 39.5
12
17
22
27
32
37
8
28
27
12
4
1
Total 80
66
5/12/2023
• Class interval: The length of the class, it is given by the difference
between class boundaries for 1st class, the interval is 5.
• Note: As sample increases, and interval reduced the sample
distribution resembles the population distribution
5/12/2023 67
• Class intervals should be continuous, non
overlapping, mutually exclusive and exhaustive
• Too few intervals results loss of information
• Too many intervals results that the objective of
summarization will not be met.
• Class intervals generally should be of the same
width (some times impossible)
• Open ended class intervals should be avoided
68
Exercise
• Construct a
grouped frequency
distribution and
complete the
following table for
the Age of patients
(years) in a diabetic
clinic in Addis
Ababa, 2010
5/12/2023 69
Age of patients (years) in a diabetic clinic in Addis
Ababa, 2010
Age
group
(Years)
Class
limit
Class
Boundary
Class
Mid
Point
Tally
Fr.
(fi)
Relative
Frequency
,
Fraction
(%)
Cumulative freq Relative Cum freq
<Method >Method <Method >Method
Total
5/12/2023 70
METHOD OF DATA PRESENTATION
5/12/2023 71
Data table
Guidelines for constructing tables
•Keep them simple
•Limit the number of variables
•All tables should be self-explanatory
•Include clear title telling what, where and
when
•Clearly label the rows and columns
72
5/12/2023
Cntd…
• State clearly the unit of measurement used
• Explain codes and abbreviations in the foot-
note
• Show totals
• If data is not original, indicate the source in
foot-note
5/12/2023 73
Graphical presentation of data
• Variety of graph styles can be used to present data.
• The most commonly used types of graph are pie charts, bar
diagrams, histograms, frequency polygon and scatter diagrams.
• The purpose of using a graph is to tell others about a set of data
quickly, allowing them to grasp the important characteristics of
the data.
• In other words, graphs are visual aids to rapid understanding.
74
5/12/2023
Importance of graphs
• Diagrams have greater attraction than mere figures.
• They give delight to the eye, add a spark of interest and as such
catch the attention
• They help in deriving the required information in less time and
without any mental strain.
• They have great memorizing value than mere figures.
• They facilitate comparison
5/12/2023 75
Bar charts
• Bar chart: Display the frequency distribution for nominal or
ordinal data.
• In a bar chart the various categories into which the observation
fall are represented along horizontal axis and
• A vertical bar is drawn above each category such that the height
of the bar represents either the frequency or the relative
frequency of observation within the class.
• The vertical axis should always start from 0 but the horizontal
can start from any where.
• The bars should be of equal width and should be separated from
one another so as not to imply continuity
76
5/12/2023
Figure 1. Bar charts showing frequency distribution of
the variable ‘BWT’.
0
1000
2000
3000
4000
5000
6000
Very low Low Normal Big
BWT
Freq.
0
20
40
60
80
100
Verylow Low Normal Big
BWT
Rel.
Freq. 77
5/12/2023
Bar charts for comparison
•Multiple bar chart: In order to compare the
distribution of a variable for two or more groups, bars
are often drawn along side each other for groups being
compared in a single bar chart.
•Sub division bar chart: If there are different
quantities forming the sub-divisions of the totals, simple
bars may be sub-divided in the ratio of the various sub-
divisions to exhibit the relationship of the parts to the
whole.
78
5/12/2023
Fig 2. Bar chart indicating categories of birth weight of 9975
newborns grouped by antenatal follow-up of the mothers
9
88.9
2.1
7.9
89
3.1
0
10
20
30
40
50
60
70
80
90
100
Low Normal Big
BWT
Percent
Yes
No
79
5/12/2023
Example: Plasmodium species distribution for confirmed malaria cases, Zeway, 2003
80
5/12/2023
Pie chart
Pie Chart: Displays the frequency distribution for
nominal or ordinal data.
• In a pie chart the various categories into which the
observation fall are represented along sectors of a
circle
• Each sector represents either the frequency or the
relative frequency of observation within the class
the angles of which are proportional to frequency or
the relative frequency.
81
5/12/2023
Figure 3. Pie charts showing frequency distribution of
the variable ‘BWT’
Fig 3(b) Pie chart indicating relative frequencyof
categories of birth weight
0.4 8
88.9
2.7
Very low
Low
Normal
Big
Fig 3(a) Pie chart indicating frequencyof categories
of birth weight
43 793
8870
268
Verylow
Low
Normal
Big
82
5/12/2023
Histogram
• Histogram is frequency distributions with continuous
class interval that has been turned into graph.
• Given a set of numerical data, we can obtain impression
of the shape of its distribution by constructing a
histogram.
• A histogram is constructed by choosing a set of non-
overlapping intervals (class intervals) and counting the
number of observations that fall in each class.
.
83
5/12/2023
Histograms cont…
•The number of observations in each class is
called the frequency. Hence histograms are
also called frequency distributions
•It is necessary that the class intervals be
non-overlapping so that each observation
falls in one and only one interval.
5/12/2023 84
Histograms cont…
• Except for the two boundaries, class intervals are usually
chosen to be of equal width. If this is not the case, the
histogram could give a misleading impression of the
shape of the data
• In drawing the histogram , smoothening of class
interval is one of important point. We subtract 0.5 from
the lower and add it up to the upper boundary of the
given interval.
85
5/12/2023
Example
Distribution of the age of women at the time of
marriage
Age group No. of women
15-19 11
20-24 36
25-29 28
30-34 13
35-39 7
40-44 3
45-49 2
86
5/12/2023
Age of women at the time of marriage
0
5
10
15
20
25
30
35
40
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
No
of
women
87
5/12/2023
Fig 5. A histogram displaying frequency distribution of birth weight of newborns at
Tikur Anbessa Hospital
Birth weight
5200
4800
4400
4000
3600
3200
2800
2400
2000
1600
1200
800
Frequency
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Std. Dev = 502.34
Mean = 3126
N = 9975.00
88
5/12/2023
Frequency polygons
• Instead of drawing bars for each class interval,
sometimes a single point is drawn at the mid point of
each class interval and consecutive points joined by
straight line.
• Graphs drawn in this way are called frequency polygons
.
• Frequency polygons are superior to histograms for
comparing two or more sets of data.
89
5/12/2023
Fig.6. Frequency polygon of birth weight of 9975 newborns at Tikur Anbessa Hospital for males and
females
Birth Weight
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
%
50
40
30
20
10
0
SEX
Males
Females
90
5/12/2023
Box and Whisker Plot
It is another way to display information when
the objective is to illustrate certain locations
(skewness) in the distribution
Can be used to display a set of discrete or
continuous observations using a single vertical
axis – only certain summaries of the data are
shown
91
5/12/2023
Box plot cont...
A box is drawn with the top of the box at the third
quartile (75%) and the bottom at the first quartile (25%).
The location of the mid-point (50%) of the distribution
is indicated with a horizontal line in the box.
Finally, straight lines, or whiskers, are drawn from the
centre of the top of the box to the largest observation
and from the centre of the bottom of the box to the
smallest observation.
92
5/12/2023
Box cont....
The box plot is then completed
Draw a vertical bar from the upper quartile to the
largest non-outlining value in the sample
Draw a vertical bar from the lower quartile to the
smallest non-outlying value in the sample
Any values that are outside the IQR but are not
outliers are marked by the whiskers on the plot
(IQR = P75 – P25)
93
5/12/2023
Box plots are useful for comparing two or
more groups of observations
94
5/12/2023
Drawing Box-and -whiskers plot
Raw data
35, 29, 44, 72, 34, 64, 41, 50, 54, 104, 39, 58
Order the data
29 34 35 39 41 44 50 54 58 64 72 104
Median = (44 + 50)/2 = 47 = Q2
Q1 = 37
Q3 = 61,Min = 29 , Max = 104
95
5/12/2023
Box plot Example
0 10 20 30 40 50 60 70 80 90 100 110
.
.
.
.
Min = 29 Q2 = 47
Q1 = 37 Q3 = 61 Max = 104
96
5/12/2023
Scatter plot
Most studies in medicine involve measuring more than one
characteristic, and graphs displaying the relationship between
two characteristics are common in literature.
When both the variables are qualitative then we can use a
multiple bar graph.
When one of the characteristics is qualitative and the other is
quantitative, the data can be displayed in box and whisker
plots
97
5/12/2023
Scatter plot ….
For two quantitative variables we use bivariate plots (also called
scatter plots or scatter diagrams).
It is used to see whether a relationship existed between the two
measures.
A scatter diagram is constructed by drawing
X-and Y-axes
Each point represented by a point or dot() represents a pair of
values measured for a single study subject =POSTIVE RELATION
98
5/12/2023
Scatter plot
• Scatter plot helps us to understand the association between two
variables using:
1. The trend
2. The shape and
3. The strength
Measure of association
• Identifying very strong and very weak association is easy by observing
the graph, but how we can classify everything in between?
5/12/2023 99
Summary of data presentation-Insertion
Study variable display method Remakes
Both variable are qualitative Bar graph
One qualitative and one quantitate Variable Box and whisker plot Used to see whether the data is skewed or not
Both variable are quantitative Scatter plot It is used to see whether a relationship
existed between the two measures.
Both variables are quantitative Line graph Useful for assessing the trend of particular situation
overtime, epidemic
5/12/2023 100
Scatter plot
• Linear correlation coefficient (R) measure the strength of association
between 2 variables.
• R values always range from -1 to 1
• R approaches to 1 shows a strong linear positive association
• R approaches to -1 shows a strong linear negative association
• R approaches to 0 shows a weak or no linear association
• Note: values in between is somewhat subjective
5/12/2023 101
0 2 4 6 8 10 12 14 16 18 20
0
10
20
30
40
50
60
Hours of Training
Accidents
Negative Correlation as x increases, y decreases
x = hours of training
y = number of accidents
Scatter Plots and Types of Correlation
Accidents
102
300 350 400 450 500 550 600 650 700 750 800
1.50
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
3.75
4.00
Math SAT
Positive Correlation as x increases y increases
x = SAT score
y = GPA
GPA
Scatter Plots and Types of Correlation
103
80
76
72
68
64
60
160
150
140
130
120
110
100
90
80
Height
IQ
IQ
No linear correlation
x = height y = IQ
Scatter Plots and Types of Correlation
104
1. Direction of Relationship
Positive
Negative
X
X
Y
Y
Scatter Diagram…
5/12/2023 105
2. Form of Relationship
Linear
Curvilinear
X
Y
X
Y
5/12/2023 106
3. Degree of Relationship
Strong
Weak
X
Y
X
Y
5/12/2023 107
5/12/2023 108
Self Insertion
Line graph
Useful for assessing the trend of particular situation
overtime. e.g. monitoring the trend of epidemics.
The time, in weeks, months or years, is marked along
the horizontal axis
Values of the quantity being studied is marked on the
vertical axis.
Values for each category are connected by continuous
line.
Sometimes two or more graphs are drawn on the same
graph taking the same scale so that the plotted graphs
are comparable.
109
5/12/2023
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
0
300
600
900
1200
1500
1800
2100
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
No.
of
confirmed
malaria
cases Positive
P. falciparum
P. vivax
110
5/12/2023
Line graph cont..
The following graph shows level of
zidovudine (AZT) in the blood of
HIV/AIDS patients at several times after
administration of the drug, for with normal
fat absorption and with fat mal absorption.
 Line graph can be also used to depict the
relationship between two continuous
variables like that of scatter diagram.
111
5/12/2023
Line graph cont…..
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999
0
1
2
3
4
5
6
7
8
10
20
70
80
100
120
170
190
250
300
360
Time since administration (Min.)
Blood
zidovudine
concentration
Fat malabsorption Normal fat absorption
112
5/12/2023
Choosing graphs
Type of Data/or
Purpose
Appropriate Graphs
Metric/Numerical -Histogram (one continuous var)
-Frequency Polygon (one/more cont. var)
-Cumulative Freq Polygon (ogive curve)
-Box and whisker (one cont. and one cat.
Var)
-Stem and Leave (one cont. var)
-Scatter (two cont. var)
Categorical -Bar (one/more cat. var) (Simple/Multiple)
-Pie (one cat. var)
Trend -Line (one cont. and one cat. Var/two
cont)
5/12/2023 113
SUMMARIZING DATA
5/12/2023
Summary Measures
Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Coefficient of Variation
Range
Interquartile Range
Geometric Mean
Skewness
Central Tendency Variation Shape
Quartiles
5/12/2023
MEASURES OF CENTRAL TENDENCY
• The tendency of statistical data to get concentrated at
certain values is called the “Central Tendency or
average”
• Mean
• Median
• Mode
5/12/2023
The Arithmetic Mean or simple Mean
•The mean is the average of the numbers. It
is add up all the numbers, then divide by
how many numbers there are
• It is written statistical terms as
5/12/2023
𝑋 =
𝑖=1
𝑛
𝑥𝑖
𝑛
Insertion
Weighted mean
𝑋 =
𝑥1𝑤1 + 𝑥2𝑤2 … … . . +𝑥𝑖𝑤𝑖
𝑤1 + 𝑤2 … … … + 𝑤𝑖
=
𝑖=1
𝑛
𝑥𝑖𝑤𝑖
𝑖=1
𝑛
𝑤𝑖
X variable of interest
W= weighing factor
Mean of grouped data
5/12/2023 118


k
1
=
i
i
k
1
=
i
i
i
f
f
m
=
x
• Example 1: What is the Mean of these numbers? 6, 11, 7
• Add the numbers: 6 + 11 + 7 = 24
• Divide by how many numbers (there are 3 numbers): 24 / 3 = 8
• The Mean is 8
Why Does This Work?
• It is because 6, 11 and 7 added together is the same as 3 lots of 8:
• It is like you are "flattening out" the numbers.
5/12/2023
Example 2
Birth weights(gm) of all live
born infant born at a private
hospital in a city, during a 1-
week period.
What is the arithmetic mean
for the sample birth weights?
5/12/2023
Weighted Mean
•When averaging quantities, it is often necessary
to account for the fact that not all of them are
equally important in the phenomenon being
described.
•In order to give quantities being averaged there
proper degree of importance, it is necessary to
assign them relative importance called weights,
and then calculate a weighted mean.
5/12/2023
5/12/2023
•The weighted mean of a set
of numbers X1, X2, … and Xn,
whose relative importance is
expressed numerically by a
corresponding set of
numbers w1, w2, … and wn, is
given by
• Example: In a given drug shop fourdifferentdrugs were sold for unit
price of 60, 85, 95 and 50 birr and the total numbers of drugs sold
were 10, 10, 5 and 20 respectively. What is the average price of the
four drugs in this drug shop?
• Solution: for this example we have to use weightedmeanusing
number of drugs sold as the respective weights for each drug's price.
Therefore, the average price will be: 65 birr
• If we don't consider the weights, the average price will be 72.5 birr
Weighted mean=
𝟔𝟎∗𝟏𝟎+𝟖𝟓∗𝟏𝟎+𝟗𝟓∗𝟓+𝟓𝟎∗𝟐𝟎
𝟏𝟎+𝟏𝟎+𝟓+𝟐𝟎
=65
5/12/2023
Weighted Mean
• We can also calculate a weighted mean using some weighting
factor:
e.g. What is the average income of all
people in cities A, B, and C :
City Avg. Income Population
A $23,000 100,000
B $20,000 50,000
C $25,000 150,000
Here, population is the weighting factor and the average
income is the variable of interest




 n
i
i
n
i
i
i
w
x
w
x
1
1
5/12/2023
Self insertion
Variable of interest= income
Weighing factor= population
Note
• Here we have to first identify the
variable of interest and the weighing
factor.
In this case
• Income is the variable of interest and
• Population is the weighting factor
125
5/12/2023




 n
i
i
n
i
i
i
w
x
w
x
1
1
𝑋 =
𝑋𝑖𝑊𝑖
𝑊𝑖
𝑋 =
23000 ∗ 105 + 20000 ∗ 50000 + 250000 ∗ 150000
100000 + 50000 + 150000
𝑋 =
7050,000,000
300,000
=23500
Geometric Mean
• The Geometric Mean is a special type of average where we multiply
the numbers together and then take a square root (for two numbers),
cube root (for three numbers) etc.
Example: What is the Geometric Mean of 2 and 18?
• First we multiply them: 2 × 18 = 36
• Then (as there are two numbers) take the square root: √36 = 6
• Geometric Mean of 2 and 18 = √(2 × 18) = 6
• It is like the area is the same!
5/12/2023
Self insertion
Method for calculating the geometric mean
There are two methods for calculating the geometric mean.
Method A
• Step 1. Take the logarithm of each value.
• Step 2. Calculate the mean of the log values by summing the log values,
then dividing by the number of observations.
• Step 3. Take the antilog of the mean of the log values to get the geometric
mean
• GM= 10
(
𝑖
=
1
𝑛
logxi
𝑛
)
5/12/2023 127
Method B
• Step 1. Calculate the product of the values by multiplying all of the values
together.
• Step 2. Take the nth root of the product (where n is the number of observations)
to get the geometric mean.
GM= n 𝒙𝟏 ∗ 𝒙𝟐 … … ∗ 𝒙𝒏 where
• GM= geometric mean
• N= number of observations
• n nth root
5/12/2023 128
Example: What is the Geometric Mean of 10, 51.2 and 8?
• First we multiply them: 10 × 51.2 × 8 = 4096
• Then (as there are three numbers) take the cube root: 3√4096 = 16
• For n numbers: multiply them all together and then take the nth
root (written n√ )
• Geometric Mean = 3√(10 × 51.2 × 8) = 16
• It is like the volume is the same:
5/12/2023
Estimating the Mean from Grouped Data
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
•The groups (51-55, 56-60, etc), also called class
intervals, are of width 5
•The midpoints are in the middle of each class: 53, 58,
63 and 68
Someone timed 21 people in the race, to the
nearest second:
5/12/2023
Cntd…
We can estimate the Mean by using the midpoints
So, how does this work?
Think about the 7 runners in the group 56 - 60: all we know is that they ran
somewhere between 56 and 60 seconds:
•Maybe all seven of them did 56 seconds,
•Maybe all seven of them did 60 seconds,
•But it is more likely that there is a spread of numbers: some at 56, some at 57, etc
So we take an average and assume that all seven of them took 58 seconds.
5/12/2023
Cntd…
• Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8
people took 63 sec and 4 took 68 sec". In other words
we imagine the data looks like this:
• 53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63,
68, 68, 68, 68
• Then we add them all up and divide by 21. The quick way to
do it is to multiply each midpoint by each frequency
• And then our estimate of the mean time to complete the race
is:
• Estimated Mean =
1288
= 61.333…
21
5/12/2023


k
1
=
i
i
k
1
=
i
i
i
f
f
m
=
x
Correct mean
• If a wrong figure has been used when calculating the mean the correct
mean can be obtained with out repeating the whole process using:
• Example: An average weight of 10 patients was calculated to be
65.Latter it was discovered that one weight was misread as 40 instead
of 80 k.g. Calculate the correct average weight.
• solution
The effect of transforming original series on the mean.
• If a constant k is added/ subtracted to/from every
observation then the new mean will be the old mean± k
respectively.
• If every observations are multiplied by a constant k then the
new mean will be k*old mean
Characteristics of mean
• The value of the arithmetic mean is determined by every
item in the series.
• It is greatly affected by extreme values.
Advantages
• It is based on all values given in the distribution.
• It is most easily understood.
• It is most amenable to algebraic treatment.
5/12/2023
Disadvantages
• It may be greatly affected by extreme items and its
usefulness as a “Summary of the whole” may be
considerably reduced.
• When the distribution has open-ended classes, its
computation would be based assumption, and therefore may
not be valid.
5/12/2023
Median
•Suppose there are n observations in a sample. If
these observations are ordered from smallest to
largest, then the median is defined as follows:
•The sample median is
5/12/2023
Example 2
2.1. Compute the sample
median for the birth weight
data in example 1.
2.2. Consider the following
data, which consists of white
blood counts taken on
admission of all patients
entering a small hospital on a
given day. Compute the
median white-blood count
(103).
7, 35,5,9,8,3,10,12,8
5/12/2023
Estimating the Median from Grouped Data
• Let's look at our data again:
The median is the middle value, which in our case is
the 11th one, which is in the 61 - 65 group:
We can say "the median group is 61 - 65"
5/12/2023
Cntd…
• We call it "61 - 65", but it really includes values from 60.5 up to (but
not including) 65.5.
• Why? the values are in whole seconds, so a real time of 60.5 is
measured as 61. Likewise 65.5 is measured as 65.
• At 60.5 we already have 9 runners, and by the next boundary at 65.5
we have 17 runners. By drawing a straight line in between we can pick
out where the median frequency of n/2 runners is:
5/12/2023
5/12/2023 141
Seconds frequency C. frequency
50-55 2 2
56-60 7 9
61-65 8 17
66-70 4 21
Cntd..
• L is the lower class boundary of the group containing the median
• n is the total number of values
• Cf is the cumulative frequency of the groups before the median group
• Fmg is the frequency of the median group
• w is the group width
• For our example:
• L = 60.5
• n = 21
• B = 2 + 7 = 9
• G = 8
• w = 5 = 61.4375
Estimated Median = L +
where
(n/2) − cf
*w
fmg
Estimated
Median
= 60.5 +
(21/2) − 9
× 5
8
5/12/2023
i) Characteristics of Median
• It is an average of position/location .
• It is affected by the number of items than by extreme values.
ii) Advantages
• It is easily calculated and is not much disturbed by extreme
values
• It is more typical of the series
• The median may be located even when the data are
incomplete, e.g, when the class intervals are irregular and the
final classes have open ends.
5/12/2023
iii) Disadvantages
• it is determined mainly by the middle points in a
sample and is less sensitive to the actual numerical
values of the remaining data points.
• It is not so generally familiar as the arithmetic mean
5/12/2023
Mode
• It is the value of the observation that occurs with the greatest
frequency.
• A particular disadvantage is that, with a small number of
observations, there may be no mode.
• In addition, sometimes, there may be more than one mode
such as when dealing with a bimodal (two-peak) distribution.
• Find the modal values for the following data
a) 22, 66, 69, 70, 73. (No modal value)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg)
5/12/2023
Estimating the Mode from Grouped Data
• We can easily find the modal group (the group with the highest
frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
Estimated Mode = L+
fm − fm-1
× w
(fm − fm-1) + (fm − fm+1)
5/12/2023
Cntd…
• where:
• L is the lower class boundary of the modal group
• fm-1 is the frequency of the group before the modal group
• fm is the frequency of the modal group
• fm+1 is the frequency of the group after the modal group
• w is the group width
• In this example:
• L = 60.5
• fm-1 = 7
• fm = 8
• fm+1 = 4
• w = 5
Estimated
Mode
= 60.5 +
8 − 7
× 5
(8 − 7) + (8 − 4)
= 60.5 + (1/5) × 5
= 61.5
5/12/2023
Mode
Characteristics
• It is an average of position
• It is not affected by extreme values
• It is the most typical value of the distribution
Advantages
• Since it is the most typical value it is the most descriptive
average
• Since the mode is usually an “actual value”, it indicates the
precise value of an important part of the series.
5/12/2023
Disadvantages:-
• Unless the number of items is fairly large and the
distribution reveals a distinct central tendency, the mode has
no significance
• It is not capable of mathematical treatment
• In a small number of items the mode may not exist.
5/12/2023
Skewness:
• If extremely low or extremely high observations are present in a
distribution, then the mean tends to shift towards those scores.
Based on the type of skewness, distributions can be:
• Negatively skewed distribution: occurs when majority of scores are
at the right end of the curve and a few small scores are scattered at
the left end.
• Positively skewed distribution: Occurs when the majority of scores
are at the left end of the curve and a few extreme large scores are
scattered at the right end.
• Symmetrical distribution: It is neither positively nor negatively
skewed. A curve is symmetrical if one half of the curve is the mirror
image of the other half.
5/12/2023
Skewness…
• Data can be "skewed", meaning it tends to have a long tail on one
side or the other:
• Negative Skew?
• Why is it called negative skew? Because the long "tail" is on the
negative side of the peak.
• The mean is also on the left of the peak.
5/12/2023
Skewness…
The Normal Distribution has No Skew
A Normal Distribution is not skewed.
It is perfectly symmetrical.
And the Mean is exactly at the peak.
5/12/2023
Skewness…
Positive Skew
And positive skew is when the long tail is on the
positive side of the peak, and some people say it
is "skewed to the right".
The mean is on the right of the peak value.
5/12/2023
Skewness…
5/12/2023
Measures of Dispersion
• Which of the
distributions of scores
has the larger
dispersion?
0
25
50
75
100
125
1 2 3 4 5 6 7 8 9 10
0
25
50
75
100
125
1 2 3 4 5 6 7 8 9 10
The upper distribution
has more dispersion
because the scores
are more spread out
5/12/2023
Measures of Dispersion
• How “spread out” the numbers are about the centre?
• Consider the following data sets:
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
• The two data sets given above have a mean of 50, but obviously set 1 is
more “spread out” than set 2 how do we express this numerically?
• Some of the commonly used measures of dispersion (variation) are: Range,
inter quartile range, quartiles, percentiles, variance, standard deviation and
coefficient of variation.
5/12/2023
Range and Interquartile Rage
• Range
• Simplest and the crudest measure of variation
• Difference between the largest and the smallest observations: Range =
Xlargest – Xsmallest
• Ignores the way in which data are distributed
• It wastes information for it takes no account of the entire data.
• Sensitive to outliers
• Interquartile Range
• Eliminate some high- and low-valued observations and calculate the range
from the remaining values
• Interquartile range = 3rd quartile – 1st quartile
= Q3 – Q1
5/12/2023
Quartiles and Percentiles
• The quartiles divide the distribution into four equal parts.
• Deciles: If data is ordered and divided into 10 parts, then cut points
are called Deciles
• Percentiles: If data is ordered and divided into 100 parts, then cut
points are called Percentiles
5/12/2023
Quartiles
• The 25th percentile is
often referred to as the
first quartile and denoted
Q1.
• The 50th percentile (the
median) is referred to as
the second or middle
quartile and written Q2’
and
• the 75th percentile is
referred to as the third
quartile, Q3.
When we wish to find the
quartiles for a set of data, the
following formulas are used
5/12/2023
Using the Five-Number Summary to Explore the Shape
• Box-and-Whisker Plot: A Graphical display of data using 5-number
summary:
• The Box and central line are centered between the endpoints if data
are symmetric around the median
Minimum, Q1, Median, Q3, Maximum
Min Q1 Median Q3 Max
Distribution Shape and
Box-and-Whisker Plot
Right-Skewed
Left-Skewed Symmetric
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Standard Deviation and Variance
• show the scatter of the individual measurements around the mean of
all the measurements in a given distribution.
• The variance represents squared units and, therefore, is not an
appropriate measure of dispersion when we wish to express this
concept in terms of the original units.
• To obtain a measure of dispersion in original units, we merely take the
square root of the variance. The result is called the standard
deviation.
• Variance the average of the squared difference from the mean
• Standard deviation is the square root of variance
5/12/2023
Variance and Standard Deviation
 
1
2




n
x
x
s i
 
N
xi
 

2


Population Sample
variance

SD
5/12/2023
To calculate standard deviation
1. Calculate the mean
x
2. Calculate the residual for each x x
x 
3. Square the residuals 2
)
( x
x 
4. Calculate the sum of the squares
 2
  x
x
5. Divide the sum in Step 4 by (n-1)  
1
2

 
n
x
x
6. Take the square root of quantity
in Step 5
 
1
2

 
n
x
x
5/12/2023
Example- Find Standard Deviation of Ungroup
Data
Family No. 1 2 3 4 5 6 7 8 9 10
Size (xi) 3 3 4 4 5 5 6 6 7 7
5/12/2023
i
x
x
xi 
 2
x
xi 
Family No. 1 2 3 4 5 6 7 8 9 10 Total
3 3 4 4 5 5 6 6 7 7 50
-2 -2 -1 -1 0 0 1 1 2 2 0
4 4 1 1 0 0 1 1 4 4 20
5
10
50




n
x
x
i
 
,
2
.
2
9
20
1
2
2






n
x
x
s
i
48
.
1
2
.
2 

s
Here,
5/12/2023
Example
• The length of a newborn baby are: 600mm, 470mm, 170mm, 430mm
and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean:
• Answer:
• Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394
5 5
• so the mean (average) height is 394 mm.
• Let's plot this on the chart:
5/12/2023
Cntd…
5/12/2023
To calculate the Variance, take each difference,
square it, and then average the result:
Standard Deviation
σ = √21,704
= 147.32...
= 147 (to the nearest
mm)
5/12/2023
Cntd…
5/12/2023
• And the good thing about the Standard Deviation is that it is useful.
Now we can show which lengths are within one Standard Deviation
(147mm) of the Mean:
• So, using the Standard Deviation we have a "standard" way of
knowing what is normal, and what is extra long or extra short.
5/12/2023
Why square the differences?
• If we just add up the differences from the mean ... the negatives
cancel the positives:
•
4+4-4-4 =0
• 4
So that won't work. How about we use absolute values?
7+1+|-6|+|-2| = 4 but if we use square root
4
√(72 + 12 + 62 + 22) = √(904) = 4.74...
4
5/12/2023
Coefficient of Variation
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Can be used to compare two or more sets of data measured
in different units
• Computed from SD and Mean (dividing SD by Mean)
100%
X
SD
CV 









5/12/2023
Basic principles of probability, rules and
its applications
5/12/2023 174
5/12/2023
Probability
• Probability is the language of chance.
• The deliberate use of chance is the central idea of statistical designs
for producing data.
• Probability is so important for data – leaders of the distribution as
maps for a journey
• Probabilities are used in everyday communication
• Probability theory was developed out of attempting to solve
problems related to games of chance such as tossing a coin, rolling a
die etc.
i.e. trying to quantify personal beliefs regarding degrees of
uncertainty.
5/12/2023
Question from Simple Probabilities
1. What is the probability that a card drawn at random
from a deck of cards will be an ace ?
4/52=1/13=0.076923
2. A book contains 32 pages numbered 1, 2, ..., 32. If a
student randomly opens the book, what is the
probability that the page number contains digit 1?
1,11,21,31 therefore 4/32= 1/8=0.125
3.A mother in the delivery room to give birth and the
health worker informed her as she will deliver at
9:30 pm. She is eager to give birth of a daughter.
What is the probability that she will get what she
wants? ½=0.5
5/12/2023
Chance
• When a meteorologist states that the chance of rain is
50%, the meteorologist is saying that it is equally likely to
rain or not to rain. If the chance of rain rises to 80%, it is
more likely to rain. If the chance drops to 20%, then it
may rain, but it probably will not rain.
• These examples suggest the chance of an occurrence of
some event of a random variable.
Basic terms
•Experiment: Is any activity from which result can
be obtained.
•Example: 1. flipping a coin
2. rolling a die
3. drawing 30 individual from the pop
•Sample space: set of possible outcome from the
experiment
Example: 1. coin toss {H, T}
2. Rolling a die {1, 2, 3, 4, 5, 6}
•Event: a collection of outcomes
5/12/2023
• The Sample Space is all
possible outcomes.
• A Sample Point is just
one possible outcome.
• And an Event can be
one or more of the
possible outcomes.
5/12/2023
Properties of probability
1. Possible outcome of probability range 0-1=0-100%
2. Generally if the two events are not exclusive or not
disjoint the probability of two events happening is given
by
 P(AuB)=P(A)+P(B)-P(AnB)
3. If two events are mutually exclusive or disjoint then
 P(AuB)=P(A)+P(B)
 P(A AND B)=P(AnB)=0
4. If two events are independent then
 P(AnB)=P(A).P(B)
 P(AB)=P(A)
 P(BA)=P(B)
5/12/2023
Unions of Two Events
•“If A and B are events, then the union of A and B, denoted by
A∪B, represents the event composed of all basic outcomes in A
or B.”
• Intersections of Two Events
“If A and B are events, then the intersection of A and B,
denoted by A∩B, represents the event composed of all
basic outcomes in A and B.”
5/12/2023
Unions and Intersections
B
A
Addition rules
• Rule 1: If 2 events, B & C, are mutually exclusive (i.e., no overlap) then
the probability that one or both occur is P(B or C) = P(B ∪ C) = P(B) +
P(C)
• Rule 2: For any given pair of events, if the sum of their probabilities is
equal to one, then those two events are mutually exclusive.
• Rule 3: For any 2 events, A & B, not mutually exclusive, the probability
that one or both occur is P(A or B) = P(A∪B) = P(A)+P(B)-P(A n B)
5/12/2023
• Example 1: One die is rolled. Sample space = S = (1, 2, 3, 4, 5,
6)
Let A = the event an odd number turns up, A = (1, 3, 5)
Let B = the event a 1, 2 or 3 turns up; B = (1, 2, 3)
Let C = the event a 2 turns up, C= (2)
I) Find Pr (A); Pr (B) and Pr (C)
• Pr (A) = Pr (1) + Pr (3) + Pr (5) = 1/6+1/6+ 1/6 = 3/6 = 1/2
• Pr (B) = Pr (1) + pr (2) + Pr (3) = 1/6+1/6+1/6 = 3/6 = ½
• Pr (C) = Pr (2) = 1/6
II) Are A and B; A and C; B and C mutually exclusive?
• A and B are not mutually exclusive. Because they have the
elements 1 and 3 in common
• Similarly, B and C are not mutually exclusive. They have the
element 2 in common
• A and C are mutually exclusive. They don’t have any element in
common
5/12/2023
The Addition . . .
If two events A and B are not mutually exclusive, then, P (A
U B) = P (A) + P (B) – P (A∩B)
Example
1. There are 80 nurses and 40 physicians in a hospital. Of
these, 70 nurses and 15 physicians are females. If a staff
person is selected at random, find the probability that the
subject is a nurse or male.
Note= Or /union And/intersection
P(N u M) = P(N) + P(M) – P(N n M)
= 80/120 + 35/ 120 – 10/ 120 = 105/ 120
Male Female Total
Nurse 70
Physician 25 15 40
Total 85 120
80
35
10
Summary of the Additive Rule
5/12/2023
Conditional probabilities and the multiplicative law
• Let’s assume two questions on a test, the
first question is a true/false and the second
is a multiple question type with five possible
answers (a, b, c, d, e)
• True or False: Heart is an organ which pumps blood in our body.
• MCQ: Which of the following human organ is used for
breathing?
a. Brain b. Liver c. Lung d. Kidney e. Heart
• If the answers are random guesses the 10
possible outcomes are equally likely so
5/12/2023
• A tree diagram is a picture of the possible outcomes
of a procedure
5/12/2023
5/12/2023
Multiplicative Rule
• When two events are said to be independent of each
other, what this means is that the probability that
one event occurs in no way affects the probability of
the other event occurring.
• For any two events A and B with non-zero probability
are Independent events, each of the following must
be true:
• P (AB)= P(A) , and P(BA)= P(B) ; and so, P(A and B)=
P(A) P(B)
5/12/2023
• Eg. 1) A classic example is n tosses of a coin and the
chances that on each toss it lands heads. These are
independent events. The chance of heads on any one
toss is independent of the number of previous heads.
No matter how many heads have already been
observed, the chance of heads on the next toss is ½.
• Eg 2) a similar situation prevails with the sex of
offspring. The chance of a male is approximately ½.
Regardless of the sexes of previous offspring, the chance
the next child is a male is still ½.
5/12/2023
• Sometimes the chance a particular event happens depends on
the outcome of some other event. This applies obviously with
many events that are spread out in time
• Eg. The chance a patient with some disease survives the next
year depends on his having survived to the present time. Such
probabilities are called conditional.
• The notation is Pr (B/A), which is read as “the probability event
B occurs given that event A has already occurred.”
• Let A and B be two events of a sample space S. The conditional
probability of an event A, given B, denoted by Pr (A/B) = P (A n
B) / P (B), P (B)  0.
5/12/2023
• Similarly, P (B/A) = P(A n B) / P(A) , P(A)  0. This can
be taken as an alternative form of the multiplicative
law.
• Where for non-independent events A and B
• P (A and B) = P (A/B) P(B) or P(A and B)= P(B/A)P(A)
• Eg. Suppose in country X the chance that an infant
lives to age 25 is .95, whereas the chance that he lives
to age 65 is .65. For the latter, it is understood that to
survive to age 65 means to survive both from birth to
age 25 and from age 25 to 65. What is the chance
that a person 25 years of age survives to age 65?
5/12/2023
Notation Event Probability
A Survive birth to age 25 .95
A and B Survive both birth to age 25 and age
25 to 65
.65
B/A Survive age 25 to 65 given survival to
age 25
?
5/12/2023
Then, Pr (B/A) = Pr (A n B) / Pr (A) = .65/.95 = .684.
That is, a person aged 25 has a 68.4 percent chance of
living to age 65.
Example
1)Consider selecting a child at random from a kindergarten; let A =
event a child is infected with ascariasis, G = event a child has
giardiasis. Suppose P(A) = .30, P(G) = .25, P(A n G) = .13.
a) What’s the probability that a child randomly selected from the
KG has giardiasis, given that we know s/he has ascariasis?
Answer, P(GA)= P(A n B)/P(A)
P(GA)=0.13/0.30=.43 the probability of a child having Giardiasis
given that he has already get ascariasis is 43%
b) What is the probability that a child randomly selected from the
KG will test negative for these intestinal parasites?
Answer P(A)+P(B)+P(C)= 0.30+0.25+P(C)= 1=P(C)=0.45
2. Of 200 senior students at a certain college, 98 are women, 34 are
majoring in Biology, and 20 Biology majors are women. If one student
is chosen at random from the senior class, what is the probability that
the choice will be either a Biology major or a woman).
Given n-=200, Known male=14(p=0.07), female majoring bio=20(p=0.1) other
females= 78(p=0.39), others 88 (p=0.44)
P(B uW)= P(B) +P(W)-p(B n W)= 0.17 + 0.48 -0.1= 0.55
5/12/2023
5/12/2023
Exercise: Calculating probability of an event
Table 1: shows the frequency of cocaine use by gender
among adult cocaine users
_______________________________________________________________________________________________
Life time frequency Male Female Total
of cocaine use
_______________________________________________________________________________________________
1-19 times 32 7 39
20-99 times 18 20 38
more than 100 times 25 9 34
--------------------------------------------------------------------------------------------
Total 75 36 111
---------------------------------------------------------------------------------------------
5/12/2023
Questions
1.What is the probability of a person randomly
picked is a male?
2. What is the probability of a person randomly
picked uses cocaine more than 100 times?
3.Given that the selected person is male, what
is the probability of a person randomly picked
uses cocaine more than 100 times?
4.Given that the person has used cocaine less
than 100 times, what is the probability of
being female?
5.What is the probability of a person randomly
picked is a male and uses cocaine more than
100 times?
Summary for the Multiplicative Rule
5/12/2023
5/12/2023
Probability as a Numerical Measure of the Likelihood of
Occurrence
0 1
.5
Increasing Likelihood of Occurrence
Probability:
The occurrence of the event is
just as likely as it is unlikely.
Permutations
The number of possible permutations is the number of
different orders in which particular events occur. The
number of possible permutations are
where r is the number of events in the series, n is the
number of possible events, and n! denotes the factorial
of
n = the product of all the positive integers from 1 to n.
Repeated events
)!
(
!
r
n
n
r
p
N


5/12/2023
Combinations
When the order in which the events occurred is of no
interest, we are dealing with combinations. The number
of possible combinations is
where r is the number of events in the series, n is the
number of possible events, and n! denotes the factorial
of n = the product of all the positive integers from 1 to
n. 
Nc 
n
r






n!
r!(n  r)!
5/12/2023
Bayes' Theorem
•Bayes' Theorem shows the relationship between a
conditional probability and its inverse.
i.e. it allows us to make an inference from
the probability of a hypothesis given the evidence to
the probability of that evidence given the hypothesis
and vice versa
Bayes' Theorem
•P(A|B) = P(B|A) P(A)
P(B)
•P(A) – the PRIOR PROBABILITY – represents your
knowledge about A before you have gathered data.
•e.g. if 0.01 of a population has schizophrenia then the
probability that a person drawn at random would have
schizophrenia is 0.01
Bayes' Theorem
•P(A|B) = P(B|A) P(A)
P(B)
•P(B|A) – the CONDITIONAL PROBABILITY – the
probability of B, given A.
•e.g. you are trying to roll a total of 8 on two dice. What
is the probability that you achieve this, given that the
first die rolled a 6?
Bayes' Theorem
•P(A|B) = P(B|A) P(A)
P(B)
•So the theorem says:
•The probability of A given B is equal to the probability
of B given A, times the prior probability of A, divided by
the prior probability of B.
5/12/2023
Probability distribution
• Every random variable has a corresponding probability distribution.
• A probability distribution applies the theory of probability to describe the
behavior of the random variable.
• The term probability distribution or just distribution refers to the way data are
distributed, in order to draw conclusions about a set of data.
• A probability distribution of a random variable can be displayed by a table or a
graph or a mathematical formula.
• With categorical variables, we obtain the frequency distribution of each variable.
• With numeric variables, the aim is to determine whether or not normality may be
assumed.
5/12/2023
I. Probability distribution of a categorical variables
• The probability distribution of a categorical variable tells us with what
probability the variable will take on the different possible values.
• That is it specifies all possible outcomes of the categorical variable along with
the probability that each will occur.
E.g. Consider the value on the face showing up from tossing a die. The probability
distribution of this variable is
Value on Face 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
• Notice that the total probability is 1.
5/12/2023
Bernoulli Distribution
•A random experiment with only one experiment with probability p
and q; where p+q=1, is called Bernoulli trials
•The outcome of an experiment can either be success (i.e., 1) and
failure (i.e., 0).
•Pr(X=1) = p, Pr(X=0) = 1-p, or
•E[X] = p, Var(X) = p(1-p)
•Bernoulli trial is a random experiment with only two possible
outcomes
5/12/2023
Binomial distribution
• In general the binomial distribution involves three assumptions
• There are fixed n number of trials each of which results in one of two mutually exclusive
outcomes.
• the outcomes of n trials are independent.
• the probability of “success” is constant for each trial
• Pr (X=success) = Pr (X=1) = p
• Pr (X=failure) = Pr (X=0) = 1-p
P(k) 
n
k





pk 1 p
 n  k
The binomial distribution
A process that has only two possible outcomes is called
a binomial process. In statistics, the two outcomes are
frequently denoted as success and failure. Binomial
distribution is a sum of independent and evenly
distributed Bernoulli trials. The binomial distribution
gives the probability of exactly k successes in n trials
P(k) 
n
k





pk 1 p
 n  k
5/12/2023
5/12/2023
Binomial distribution….
• In addition to the probabilities of individual outcomes, we can also compute the
numerical summary measures associated with a probability distribution.
• The mean and variance values for a binomial distribution or the average number
of successes in repeated samples of n is equal to
• Example 1: From the sample of 1000 US population, there are 290 smokers, if
we want to get the mean and standard deviation of the proportion of smokers,
we can use the formula of the following;
• Mean=nxp=1000x0.29=290
______________
S.d = √1000(0.29X0.71) = 14.4
np



V  npq
5/12/2023
Binomial distribution….
Example 2: Suppose that in a certain population 52% of all recorded births are
males. If we select randomly 10 birth records What is the probability that
exactly
•5 will be males? Given n=10, x=5,
• Pr (X= x) = n! p x (1- p) n- x
x ! (n -x )!
So Pr (X=5) = 10! X 0.52 5 x (1- 0.52)10-5 =0.24
5!(10-5)!
•3 or more will be females?
• Pr(X≥3) = 1- Pr (X<3) = 1-[Pr(X=0)+Pr(X=1)+Pr(X=2)]
=1-[0.001+0.013+0.055]= 1-0.069=0.931
Random variable and Probability distributions
• A random variable is a variable that has a single numerical value, determined
by chance, for each outcome of a procedure.
• A discrete random variable has either a finite number of values or a
countable number of values. Eg. The number of eggs that a hen lays in a
day(possible values are 0, or 1, or 2
• A continuous random variable has infinitely many values, and those values
can be associated with measurements on a continuous scale in such a way
that there are no gaps or interruptions.
Eg. Voltage of electricity
5/12/2023
Every probability distribution must satisfy each of the
following two requirements
•Since the values of a probability
distribution are probabilities, they
must be numbers in the interval from
0 to 1.
•Since a random variable has to take on one of its
values, the sum of all the values of a probability
distribution must be equal to 1.
5/12/2023
Random Variable
•A Random Variable is a set of possible values from a
random experiment
•Example: Tossing a coin: we could get Heads or Tails.
•Let's give them the values Heads=0 and Tails=1 and
we have a Random Variable "X":
random possible random
variable values events
0 H
X =
1 T
5/12/2023
• So:
• We have an experiment (like tossing a coin)
• We give values to each event
• The set of values is a Random Variable
5/12/2023
• Eg. Toss a coin 3 times. Let x be the number of heads obtained. Find the
probability distribution of x . f (x) = Pr (X = xi) , i = 0, 1, 2, 3.
• Pr (x = 0) = 1/8 …………………………….. TTT
• Pr (x = 1) = 3/8 ……………………………. HTT THT TTH
• Pr (x = 2) = 3/8 ……………………………..HHT THH HTH
• Pr (x = 3) = 1/8 ……………………………. HHH
• Probability distribution of X.
• The required conditions are also satisfied. i) f(x)  0 ii)  f (xi) = 1
5/12/2023
X = xi 0 1 2 3
Pr(X=xi) 1/8 3/8 3/8 1/8
The birth of a son or a daughter
are mutually exclusive events
because the two events will not
happen at the same time.
The birth of a daughter and the
birth of carrier of the sickle-cell
anemia allele are not mutually
exclusive because the two events
can happen at the same time (they
are independent events).
5/12/2023
5/12/2023
Example : Sex Ratio in a Family of 3
• Assume that the probability of a boy =
1/2 and the probability of a girl = 1/2.
i. How many possibilities are there for a
family to have the sex distribution?
ii. What is the probability of occurrence
of each event?
iii. What is the chance of 2 boys AND 1
girl?
child
#1
child
#2
child
#3
B B B
B B G
B G B
B G G
G B B
G B G
G G B
G G G
• Solution:
i. 8 possibilities
ii. The probability of each event is 1/8 (
1/2 x 1/2 x 1/2).
iii. The chances of 2 boys AND 1 girl are
3. This occurs: BBG, BGB, and GBB.
• Thus, the chance is 1/8 + 1/8 + 1/8 =
3/8.
5/12/2023
The expected value of a discrete random variable
The expected value, denoted by E(x) or , represents the “average” value of the random variable. It is
obtained by multiplying each possible value by its respective probability and summing over all the values
that have positive probability.
Definition: The expected value of a discrete random variable is defined as
E(X) =  = )
x
P(X
n
x i
1
i
i 


5/12/2023
Where the xi’s are the values the random variable assumes with positive probability
Example: Consider the random variable representing the number of episodes of diarrhea in the first 2
years of life. Suppose this random variable has a probability mass function as below
R 0 1 2 3 4 5 6
P(X
= r)
.129 .264 .271 .185 .095 .039 .017
What is the expected number of episodes of diarrhoea in the first 2 years of life?
E(X) = 0(.129) +1(.264) +2(.271) +3(.185) +4(.095) +5(.039) +6(.017) = 2.038
Thus, on the average a child would be expected to have 2 episodes of diarrhoea in the first 2 years of life
5/12/2023
The variance of a discrete random variable
The variance represents the spread of all values that have positive probability relative to the expected
value. In particular, the variance is obtained by multiplying the squared distance of each possible value
from the expected value by its respective probability and summing overall the values that have positive
probability.
Definition: The variance of a discrete random variable denoted by X is defined by
V(X) =  








k
1
k 2
i
2
i
i
2
i
2
1
μ
)
x
P(X
x
)
x
P(X
)
μ
x
(
σ
i i
Where the Xi’s are the values for which the random variable takes on positive probability. The SD of a
random variable X, denoted by SD(X) or  is defined by square root of its variance.
5/12/2023
Example: Compute the variance and SD for the random variable representing number of episodes
of diarrhea in the first 2 years of life.
E(X) =  = 2.04
)
x
P(X
n
x i
1
i
i 


= 02
(.129) + 12
(.264) + 22
(.271) + 32
(.185) + 42
(.095) + 52
(.039) + 62
(0.017) = 6.12
Thus, V(X) = 6.12 – (2.04)2
= 1.967 and the SD of X is 1.402
1.967
σ 

5/12/2023
5/12/2023
Binomial distribution, generally
X
n
X
n
X
p
p 







)
1
(
1-p = probability of
failure
p = probability of
success
X = #
successes out
of n trials
n = number of trials
Note the general pattern emerging  if you have only two
possible outcomes (call them 1/0 or yes/no or success/failure) in n
independent trials, then the probability of exactly X “successes”=
5/12/2023
Exercise
1. Each child born to a particular set of
parents has a probability of 0.25 of having
blood type O. If these parents have 5
children.
What is the probability that
a. Exactly two of them have blood type
O=0.3516
b. At most 2 have blood type O=0.5592
c. At least 4 have blood type O=0.8229
d.2 do not have blood type O.=
Exercise….
2. Suppose past experiences in a certain malarious area
indicated that the probability of a person with a high
fever will be positive for malaria is 0.7. Consider 3
randomly selected patients (with high fever) in that same
area.
a) What is the probability that no patient will be positive
for malaria?=0.027
b) What is the probability that exactly one patient will be
positive for malaria?=0.189
c) What is the probability that exactly two of the patients
will be positive for malaria?=0.441
d) What is the probability that all patients will be positive
for malaria?=0.343
5/12/2023
The Poisson distribution
When the probability of “success” is very small, e.g., the
probability of a mutation, then pk and (1 – p)n – k become too
small to calculate exactly by the binomial distribution. In
such cases, the Poisson distribution becomes useful. Let l
be the expected number of successes in a process
consisting of n trials, i.e., l = np. The probability of
observing k successes is
The mean and variance of a Poisson distributed variable are
given by  = l and V = l, respectively.
P(k) 
lkel
k!
5/12/2023
5/12/2023
Plots of Poisson Distribution
5/12/2023
The Poisson distribution…
•Example 3. Suppose x is a random variable representing
the number of individuals involved in a road accident
each year (In US 2.4 are involved per 10,000 population
each year)
•I.e. λ = 2.4 per 10000
•Pr (X=0) = e-2.4 2.40 = 0.091
0!
•Pr (X=1) = e-2.4 2.41 = 0.218
1!
•Pr (X=2) = e-2.4 2.42 = 0.262
2!
5/12/2023
II. Probability distribution of Numeric variables
1. Probability distribution of a discrete variable
•Let X be a discrete random variable, such as
number of new AIDS cases reported during
one year period, number of children in a
family
•To construct the probability distribution for
X we list each of the values x the variable
assumes and its associated probability
(relative frequency).
5/12/2023
Characteristics of a distribution
•Features commonly used to describe a distribution are
location, dispersion, modality and skewness.
•Location tells us something about the average
value of the variable.
•Dispersion tells us something about how spread
out, the values of the variable are.
•Modality refers to the number of peaks in the
distribution.
•Skew ness refers to whether or not the
distribution is symmetric
•A distribution is said to be symmetric if it is
symmetrically distribute about its mode.
5/12/2023
2.Probability distribution of continuous variables
•Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
•E.g. Suppose, X represents the continuous
variable ‘Height’; rarely is an individual
exactly equal to 170cm tall
• X can assume an infinite number of
intermediate values 170.1, 170.2, 170.3 etc.
•Because a continuous random variable X can take on an
uncountably infinite number of values, the probability
associated with any particular one value is almost equal
to zero.
5/12/2023
Continuous Random Variables
• A smooth curve describes the probability distribution of a
continuous random variable.
•The depth or density of the probability, which varies with x,
may be described by a mathematical formula f (x ), called
the probability distribution or probability density function for
the random variable x.
5/12/2023
Properties of Continuous Probability Distributions
• The area under the curve is equal to 1.
• P(a  x  b) = area under the curve between a and b.
•There is no probability attached to any
single value of x. That is, P(x = a) = 0.
5/12/2023
Continuous Probability Distributions
• There are many different types of continuous
random variables
• We try to pick a model that
• Fits the data well
• Allows us to make the best possible
inferences using the data.
• One important continuous random variable is the
normal random variable.
5/12/2023
The Normal(Gaussian) Distribution
•The normal distribution is used extensively in the analyses of
continuous variables and has an especially important role in
statistics.
•It has been found to be a good approximation for many
distributions that arise in practice.
•The normal distribution is a uni-modal and symmetric.
•The normal distribution is completely described by two
parameters, referred as the mean μ (read as ‘mu’) and standard
deviation σ (read ‘sigma’).
•The mean μ can be any number (negative, positive or zero).
•The standard deviation σ must be a positive number.
•The mean μ defines the location of the distribution and the SD
(standard deviation) σ defines the dispersion of the distribution
about the mean.
5/12/2023
The Normal Distribution
deviation.
standard
and
mean
population
the
are
and
1416
.
3
7183
.
2
for
2
1
)
(
2
2
1


















 

e
x
e
x
f
x
• The shape and location of the normal curve changes
as the mean and standard deviation change.
• The formula that generates the
normal probability distribution is:
How the Normal curve shifts
change
when parameters
1 X-μ
-1 0
0 1 X-μ
-1
𝜎
-
-a μ a X
0 1
location (μ) different 𝜎 (S.D)
Same but
𝜎=1
𝜎-2
𝜎=3
μ
Biostatistics course by Girma Taye
(PhD), AAU
Empirical rule
68%=𝜎=1 means 68% of the x
values lies within 1𝜎 from the mean
95%= 𝜎=2 means 95% of the x
values lies within 2𝜎 from the mean
99.7%=𝜎=3 means 99.7% of the x
values lies within 3𝜎 from the mean
Same 𝜎 but different location (mean)
μ=0 μ=1 μ=2
Biostatistics course by Girma Taye
(PhD), AAU
5/12/2023
The standard normal distribution
• Since a normal distribution could be an infinite number of possible values for its
mean and SD, it is impossible to tabulate the area associated for each and every
normal curve.
• Instead only a single curve for which μ = 0 and σ = 1 is tabulated.
• The curve is called the standard normal distribution (SND).
5/12/2023
The Standard Normal Distribution
•To find P(a < x < b), we need to find the area under the
appropriate normal curve.
•To simplify the tabulation of these areas, we
standardize each value of x by expressing it as a z-
score, the number of standard deviations  it lies from
the mean .




x
z
5/12/2023
The Standard Normal
(z) Distribution
• Mean = 0; Standard deviation = 1
• When x = , z = 0
• Symmetric about z = 0
• Values of z to the left of center are negative
• Values of z to the right of center are positive
• Total area under the curve is 1.
5/12/2023
Using normal table
The four digit probability in a particular row and column
of Table 1 gives the area under the z curve to the left
that particular value of z.
Area for z = 1.36
P(z 1.36) = .9131
P(z >1.36)
= 1 - .9131 = .0869
P(-1.20  z  1.36) =
.9131 - .1151 = .7980
5/12/2023
Example
Use Table 1 to calculate these probabilities:
5/12/2023
Example
The weights of packages of ground beef are
normally distributed with mean 1 pound and
standard deviation .10. What is the probability
that a randomly selected package weighs between
0.80 and 0.85 pounds?


 )
85
.
80
(. x
P




 )
5
.
1
2
( z
P
0440
.
0228
.
0668
. 

5/12/2023
Example
What is the weight of a package
such that only 1% of all packages
exceed this weight?
233
.
1
1
)
1
(.
33
.
2
?
33
.
2
1
.
1
?
1,
Table
From
01
.
)
1
.
1
?
(
01
.
?)
(










z
P
x
P
5/12/2023
Approximating the Binomial
Make sure to include the entire rectangle for the
values of x in the interval of interest. This is called
the continuity correction.
Standardize the values of x using
npq
np
x
z


Make sure that np and nq are both greater
than 5 to avoid inaccurate approximations!
Exercise
A data collected on systolic blood pressure in normal
healthy individuals is normally distributed with μ= 120
and σ= 10 mm Hg.
1)What proportion of normal healthy individuals have a
systolic blood pressure above 130 mm Hg-=0.8554
2)What proportion of normal healthy individuals have a
systolic blood pressure between 100 and 140 mm
Hg?=0.9544
3)What level of systolic blood pressure cuts off the lower
95% of normal healthy individuals?=0.4772
5/12/2023
μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ
Fig.3. Percentage of area under a normal distribution with mean μ and
standard deviation σ
Empirical rule
For any normal distribution,
 about 68% (most) of the observations is contained within one SD of
the mean.
about 95% (majority) of the probability is contained within two SDs
and 99.7% (almost all) within three SDs of the mean.
5/12/2023
5/12/2023
Exercises
• Find the probability of the following under the SND
•Above 1.96? z>1.96= 1-0.4750=0.525
•Below –1.96? Z<-1.96=1-.4750=0.525
•Between –1.28 and 1.28? -1.28<z>1.28
•Between –1.65 and 1.08? 0.8502
•What level cuts the upper 25%?
• =1-25=0.75
•What level cuts the middle 99%?=1-
0.99=0.01, 0.01/2=0.005
Area between 0 and z
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
5/12/2023
Table 1: Normal distribution
t table with right tail probabilities
dfp 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
1 0.324920 1.000000 3.077684 6.313752 12.70620 31.82052 63.65674 636.6192
2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991
3 0.276671 0.764892 1.637744 2.353363 3.18245 4.54070 5.84091 12.9240
4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103
5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688
6 0.264835 0.717558 1.439756 1.943180 2.44691 3.14267 3.70743 5.9588
7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079
8 0.261921 0.706387 1.396815 1.859548 2.30600 2.89646 3.35539 5.0413
9 0.260955 0.702722 1.383029 1.833113 2.26216 2.82144 3.24984 4.7809
10 0.260185 0.699812 1.372184 1.812461 2.22814 2.76377 3.16927 4.5869
5/12/2023
Table 2: Student’s t-distribution
Thank you!
254
5/12/2023

More Related Content

Similar to 1 Introduction to Biostatistics.pptx

2021f_Cross-sectional study.pptx
2021f_Cross-sectional study.pptx2021f_Cross-sectional study.pptx
2021f_Cross-sectional study.pptxtamanielkhair
 
Seminar on survey methods
Seminar on survey methodsSeminar on survey methods
Seminar on survey methodsSachin Shekde
 
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2yonas kebede
 
Chapter-one.pptx
Chapter-one.pptxChapter-one.pptx
Chapter-one.pptxAbebeNega
 
Characteristics, strengths, weaknesses, and kinds
Characteristics, strengths, weaknesses, and kindsCharacteristics, strengths, weaknesses, and kinds
Characteristics, strengths, weaknesses, and kindsPeterKentDelossantos1
 
Research Methadology.pptx
Research Methadology.pptxResearch Methadology.pptx
Research Methadology.pptxSurbhit999
 
Introduction to statistics in health care
Introduction to statistics in health care Introduction to statistics in health care
Introduction to statistics in health care Dhasarathi Kumar
 
Practical research 2
Practical research 2Practical research 2
Practical research 2Grisel Salvia
 
1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...
1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...
1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...Ashesh1986
 
Survey procedures in dentistry
Survey procedures in dentistrySurvey procedures in dentistry
Survey procedures in dentistrydeepthiRagasree
 
Unit 2. Introduction to Quantitative & Qualitative Reseaerch.pptx
Unit 2. Introduction to Quantitative & Qualitative Reseaerch.pptxUnit 2. Introduction to Quantitative & Qualitative Reseaerch.pptx
Unit 2. Introduction to Quantitative & Qualitative Reseaerch.pptxshakirRahman10
 
Introduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptxIntroduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptxMelba Shaya Sweety
 
Biostatistics
BiostatisticsBiostatistics
BiostatisticsPRIYAG63
 
The importance of quantitative research across fields.pptx
The importance of quantitative research across fields.pptxThe importance of quantitative research across fields.pptx
The importance of quantitative research across fields.pptxCyrilleGustilo
 

Similar to 1 Introduction to Biostatistics.pptx (20)

Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Biostatistics khushbu
Biostatistics khushbuBiostatistics khushbu
Biostatistics khushbu
 
2021f_Cross-sectional study.pptx
2021f_Cross-sectional study.pptx2021f_Cross-sectional study.pptx
2021f_Cross-sectional study.pptx
 
Seminar on survey methods
Seminar on survey methodsSeminar on survey methods
Seminar on survey methods
 
Data collection
Data collectionData collection
Data collection
 
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
 
Chapter-one.pptx
Chapter-one.pptxChapter-one.pptx
Chapter-one.pptx
 
Characteristics, strengths, weaknesses, and kinds
Characteristics, strengths, weaknesses, and kindsCharacteristics, strengths, weaknesses, and kinds
Characteristics, strengths, weaknesses, and kinds
 
chapter 1.pptx
chapter 1.pptxchapter 1.pptx
chapter 1.pptx
 
Research Methadology.pptx
Research Methadology.pptxResearch Methadology.pptx
Research Methadology.pptx
 
Introduction to statistics in health care
Introduction to statistics in health care Introduction to statistics in health care
Introduction to statistics in health care
 
Practical research 2
Practical research 2Practical research 2
Practical research 2
 
Chapter 7 Knowing Our Data
Chapter 7 Knowing Our DataChapter 7 Knowing Our Data
Chapter 7 Knowing Our Data
 
1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...
1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...
1. unit 3 part I- intro with (a) Observational studies – descriptive and anal...
 
Survey procedures in dentistry
Survey procedures in dentistrySurvey procedures in dentistry
Survey procedures in dentistry
 
Unit 2. Introduction to Quantitative & Qualitative Reseaerch.pptx
Unit 2. Introduction to Quantitative & Qualitative Reseaerch.pptxUnit 2. Introduction to Quantitative & Qualitative Reseaerch.pptx
Unit 2. Introduction to Quantitative & Qualitative Reseaerch.pptx
 
Introduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptxIntroduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptx
 
GROUP 20.pptx
GROUP 20.pptxGROUP 20.pptx
GROUP 20.pptx
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
The importance of quantitative research across fields.pptx
The importance of quantitative research across fields.pptxThe importance of quantitative research across fields.pptx
The importance of quantitative research across fields.pptx
 

Recently uploaded

VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...Miss joya
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
 
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...Miss joya
 
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...indiancallgirl4rent
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Serviceparulsinha
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...Garima Khatri
 
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...narwatsonia7
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...CALL GIRLS
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls ServiceMiss joya
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Servicemakika9823
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...narwatsonia7
 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Servicenarwatsonia7
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 

Recently uploaded (20)

VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
 
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
 
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
 
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls Service
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 

1 Introduction to Biostatistics.pptx

  • 1. Biostatistics Mengistu Y. (BSC, MPH-HI, PhD fellow, Assi. Prof. PH) 2022
  • 2. Learning Objectives General Objective ♦ To provide the statistical methods and numerical descriptions that is useful to generate information about certain situations and present them in such a way that valid interpretations are possible Specific Objectives ♦ design, organize, present and summarize data ♦ understand the process involved in data collection and processing ♦ distinguish between categorical and numeric data ♦ understand probabilities and their applications ♦ interpret summary statistics, graphical displays and contingency tables commonly presented in the health literature ♦ carry out exploratory data analysis ♦ understand the process involved in estimations and hypothesis testing ♦ interpret the functions of confidence intervals and p-values ♦ give an interpretation or reach a conclusion about a population on the basis of information contained in a sample drown from that population.
  • 3. Course content ♦ Introduction to the course ♦ Data and Scales of measurement ♦ Methods of data organization and presentation ♦ Frequency distribution ♦ Measures of central tendency and dispersion ♦ Basic principles of probability ♦ Rules of probability and applications (additive, multiplicative, Bayes')
  • 4. References: (available in the Library) 1. Gordis, L. (2009). Epidemiology (4th ed.). USA. Elsevier Inc. 2. Koepsell & Weiss. Epidemiologic Methods. Oxford University Press, 2003. 3. Last (ed.) Dictionary of Epidemiology, 1995 4. Rothman, Kenneth J.; Greenl and, Sander; Lash, Timothy L. Modern Epidemiology, 3rd Ed Lippincott Williams & Wilkins.2008. 5. Martin Bland. An introduction to Medical Statistics 6. Colton T. Statistics in Medicine 7. Daniel W. Biostatistics a foundation for analysis in the Health Sciences 8. Kirkwood BR. Essentials of Medical Statistics 9. Knapp RG, Miller MC. Clinical epidemiology and Biostatistics. Baltimore Williams and Wilkins, 1992 10. Pagano & Gauvereau. Principles of Biostatistics 11. Schelesslman, J.J. Case control studies, Design, Conduct, Analysis, Oxford University Press, New York, 1982 12. Breslow, N.E. Statistical Methods in cancer Research, Volume I-The analysis of case-control studies
  • 5. Statistics is the science of gaining information from data through  collecting data  organizing  Summarizing data  Presenting data  analysing and drawing conclusion (inferences) from data.  It is helpful to think of the process of data analysis as consisting of three stages: management, descriptive and inferential Introduction 5
  • 6. Definitions • Statistics: is used to mean either statistical data or statistical methods. • Statistical data: • When it means statistical data it refers to numerical descriptions of things. • These descriptions may take the form of counts or measurements. NB Even though statistical data always denote figures (numerical descriptions) it must be remembered that all 'numerical descriptions' are not statistical data • Statistical methods: • It refers to a body of methods that are used for collecting, organising, summarizing, analysis and interpreting numerical data for understanding a phenomenon or making wise decisions. 6 5/12/2023
  • 7. Definitions… • Biostatistics is the application of different statistical methods for biological, medical and public health data • A population is any specific collection of objects of interest. • A sample is any subset or sub-collection of the population • A census is the case that the sample consists of the whole population. 5/12/2023 7
  • 8. Definitions ... • A measurement is a number or attribute computed for each member of a population or of a sample. • A parameter is the characteristics of the population as a whole. • A statistic is the characteristics of the sample data. • Descriptive statistics is a study of data: involves organizing, displaying, and describing properties of the data • Inferential statistics is drawing conclusions about a population of interest based on information contained in the sample taken from the population. 5/12/2023 8
  • 9. Definitions … • The distinction between a population together with its parameters and a sample together with its statistics is a fundamental concept in inferential statistics. population sample 9 5/12/2023 Statistics parameters Inference
  • 10. Definition … • A Variable is a characteristic which takes different values in different persons, places, or things. In general it is a characteristic which takes different values. • Variables are things that we measure, control, or manipulate in research. ♦Data: are measurements or observations (value) recorded for each element. For example, data include record on weight, length, breaking strength, age, sex, religion, marital status, income etc. Based on the nature of the variables we can have qualitative and quantitative data.
  • 11. Dependent vs. Independent Independent variable: ♦ A variable that you believe might influence your outcome measure. ♦ This might be a variable that you control, like a treatment, or a variable not under your control, like an exposure. It also might represent a demographic factor like age or gender. ♦ An independent variable is a hypothesized cause of the dependent variable • Any variable that you are using to make those predictions is an independent variable. • Example: The relationship of dietary fat consumption and the development of ischemic stroke. In this study, the independent variables were: Percentage of total fat in the diet,
  • 12. Dependent variable • In a research, the variable that you believe might be influenced or modified by some treatment or exposure. • It may also represent the variable you are trying to predict. • The dependent variable is called the outcome variable. This definition depends on the context of the study. • Example: A study examined the relationship of dietary fat consumption and the development of ischemic stroke. • In this study, the dependent variable was incidence of ischemic stroke.
  • 13. Characteristics of statistical data i) They must be in aggregates – are 'number of facts.' A single fact, even though numerically stated, cannot be called statistics. ii) They must be affected to a marked extent by a multiplicity of causes. This means that statistics are aggregates of such facts only as grow out of a ' variety of circumstances'. Thus the explosion of outbreak is attributable to a number of factors, e.g. Human factors, parasite factors and environmental factors. iii) They must be enumerated or estimated according to a reasonable standard of accuracy. If statistical data is incorrect the results are bound to be misleading. 13 5/12/2023
  • 14. Characteristics… iv) They must have been collected in a systematic manner for a predetermined purpose. Numerical data can be called statistics only if they have been compiled in a properly planned manner and for a purpose about which the enumerator had a definite idea. v) They must be placed in relation to each other. That is, they must be comparable. Numerical facts may be placed in relation to each other either in point of time, space or condition. 14 5/12/2023
  • 15. Source of data • Routine data collection • Routine health unit and community data • Activity data about patients seen and programmes run, routine services and epidemiological surveillance; • Semi-permanent data about the population served, the facility itself and staff that run it • Vital registration • Non-routine data collection • Surveys • Population census (headcounts proportion/facility catchment’s area) • Quantitative or qualitative rapid assessments. 5/12/2023 15
  • 16. Techniques of data collection Data collection is a crucial stage in the planning and implementation of a study If the data collection has been superficial, biased or incomplete, data analysis becomes difficult, and the research report will be of poor quality. Therefore, we should concentrate all possible efforts on developing appropriate tools, and should test them several times. 16
  • 17. Observation: is a technique that involves systematically selecting, watching and recording behavior and characteristics of living things, objects or phenomena. • Observation of human behavior is a much-used data collection technique. It can be undertaken in different ways; • Participant observation: The observer takes part in the situation he or she observes. • Non-participant observation: The observer watches the situation, openly or concealed, but does not participate Techniques of collecting data con’td 17
  • 18. • Observations can give additional, more accurate information on behavior of people than interviews or questionnaires • Observations can also be made on objects; • For example, the presence or absence of a latrine and its state of cleanliness may be observed. • Here observation would be the major research technique Data collection techniques con’d 18
  • 19. • Interview (face-to-face): is a data-collection technique that involves oral questioning of respondents, either individually or as a group. • Answers to the questions posed during an interview can be recorded by writing them down (either during the interview itself or immediately after the interview) or by tape-recording the responses, or by a combination of both. Data collection techniques con’d 19
  • 20. • Administer written questionnaire: is a data collection tool in which written questions are presented that are to be answered by the respondents in written form • A written questionnaire can be administered in different ways, such as by: Sending questionnaires by mail with clear instructions on how to answer the questions and asking for mailed responses; Gathering all or part of the respondents in one place at one time, giving oral or written instructions, and letting the respondents fill out the questionnaires; Hand-delivering questionnaires to respondents and collecting them later Data collection techniques con’d 20
  • 21. Types of questions • Depending on how questions are asked and recorded we can distinguish two major possibilities 1. Open-ended questions: (allowing for completely open as well as partially categorized answers) It permit free responses which should be recorded in the respondents' own words. 21
  • 22. Types of questions Such questions are useful for obtaining in-depth information on: • facts with which the researcher is not very familiar, • opinions, attitudes and suggestions of informants, or • sensitive issues. 22
  • 23. Types of questions • Example; 1. 'What is your opinion on the services provided in the ANC?' (Explain why.) 2. 'What do you think are the reasons some adolescents in this area start using drugs? 3. 'What would you do if you noticed that your daughter (school girl) had a relationship with someone?' 23
  • 24. Types of questions • Advantage of open-ended questions • Allow you to probe more deeply into issues of interest being raised. • Information provided in the respondents' own words might be useful • Risks of completely open-ended questions • A big risk is incomplete recording of all relevant issues covered in the discussion. • Analysis is time-consuming and requires experience; otherwise important data may be lost. 24
  • 25. Types of questions 2. Closed questions: have a list of possible options or answers from which the respondents must choose Closed questions are most commonly used for background variables such as age, marital status or education, although in the case of age and education you may also take the exact values and categorise them during data analysis 25
  • 26. Types of questions 1. 'Women who have induced abortion should be severely punished.‘ 26
  • 27. Types of questions 2. Did you eat any of the following foods yesterday?' (Circle yes if at least one item in each set of items is eaten.) 27
  • 28. Types of questions • Advantages of closed questions • It saves time • Comparing responses of different groups, or of the same group over time, becomes easier. • Risks of closed questions: • In case of illiterate respondents, bias will be introduce 28
  • 29. Steps in designing questionnaire 1. Content: Take your objectives and variables as a starting point 2. Formulating questions: Formulate one or more questions that will provide the information needed for each variable.  Check whether each question measures one thing at a time.  Avoid leading questions.  Ask sensitive questions in a socially acceptable way: 29
  • 30. Steps in designing questionnaire 3. Sequencing the questions: Design your interview schedule or questionnaire to be 'informant friendly‘ 4. Formatting the questionnaire: When you finalize your questionnaire, be sure that  A separate, introductory page is attached to each questionnaire 30
  • 31. Steps in designing questionnaire explaining the purpose of the study  requesting the informant's consent to be interviewed assuring confidentiality of the data obtained. • Each questionnaire has a heading and space to insert the number, date and location of the interview • You may add the name of the interviewer, to facilitate quality control. 31
  • 33.  Focus group discussions: It allows a group of 8 - 12 informants to freely discuss a certain subject with the guidance of a facilitator or reporter  In-depth interview  Key informant interview For qualitative study 33
  • 34. Rationale of studying statistics • Why do we need to use statistics • – The reason is: Presence of variability • Statistics pervades a way of organizing information on a wider and more formal basis • More and more things are now measured quantitatively in medicine and public health • There is a great deal of intrinsic (inherent) variation in most biological processes • Public health and medicine are becoming increasingly quantitative. As technology progresses, the physician encounters more and more quantitative rather than descriptive information. 34 5/12/2023
  • 35. Rationale…. • The planning, conduct, and interpretation of much of medical research are becoming increasingly reliant on statistical technology. Is this new drug or procedure better than the one commonly in use? How much better? What, if any, are the risks of side effects associated with its use? • Statistics pervades the medical literature. 35 5/12/2023
  • 36. Limitations of statistics 1. It deals on aggregates of facts and no importance is attached to individual items–suited only if their group characteristics are desired to be studied. 2. Statistical data are only approximately and not mathematically correct. 36 5/12/2023
  • 37. Data and types of data • Qualitative (or categorical) data consist of values that can be separated into different categories that are distinguished by some nonnumeric characteristic. • Cannot be measured in quantitative form but can only be identified by name or categories • Quantitative data consist of values representing counts or measurements. Expressed numerically and they can be of two types (discrete or continuous). 37 5/12/2023
  • 38. Types of Quantitative Data • Continuous data can take on any value in a given interval. Continuous data values results from some continuous scale that covers a range of values without gaps, interruptions, or jumps. • Discrete data can take on only particular distinct values and not other values in between. The values in discrete data is either a finite number or a countable number. 38 5/12/2023
  • 39. Scale of measurement • Nominal • Ordinal • Interval • Ratio • Nominal and ordinal are qualitative (categorical) levels of measurement. • Interval and ratio are quantitative levels of measurement. 39 5/12/2023
  • 40. Types of Variables • Variable types can be distinguished based on their scale, Typically, different statistical methods are appropriate for variables of different scales scale Characteristic questions Examples Nominal Is A different than B? Marital status, Eye color, Gender, Religious affiliation, Race Ordinal Is A bigger than B? Stage of disease Severity of pain Level of satisfaction Interval By how many units do A and B differ? Temperature Ratio How many times bigger than B is A? Distance, Length Time until death Weight 40 5/12/2023
  • 41. Operations that make sense for variables of different scales Scale Operation that make sense Counting Ranking Addition/ subtraction Multiplication/ Division Nominal  . Ordinal  .  . Interval  .  .  . Ratio  .  .  .  . 41 5/12/2023
  • 42. TYPES OF QUALITATIVE MEASUREMENTS • Nominal level of measurement—classifies data into names, labels or categories in which no order or ranking can be imposed. Example: Sex ( M, F) Exam result (P, F) Blood Group (A,B, O or AB) Color of Eyes (blue, green, brown, black) 42 5/12/2023
  • 43. • Ordinal level of measurement—classifies data into categories that can be ordered or ranked, but precise differences between the ranks do not exist. Generally it does not make sense to do calculations with data at the ordinal level. Example: Response to treatment (poor, fair, good) Severity of disease (mild, moderate, severe) Income status (low, middle, high) 43 5/12/2023
  • 44. TYPES OF QUANTITATIVE MEASUREMENTS • Interval level of measurement—ranks data, precise differences between units of measure exist, but there is no meaningful zero. If a zero exists, it is an arbitrary point. Example—IQ scores, it makes sense to talk about someone having an IQ 20 points higher than another person, but an IQ of zero has no meaning. • Ratio level of measurement—has all the characteristics of the interval level, but a true zero exists. Also, true ratios exist when the same variable is measured on two different members of the population. Example—weight of an individual. It makes sense to say that a 150 lb adult weighs twice as much as a 75 lb. child. 44 5/12/2023
  • 45. Copyright © 2009 Pearson Education, Inc. summarizes the possible data types and levels of measurement. Figure 1 Data types and levels of measurement. 45 5/12/2023
  • 46. Data organization and presentation • Statistics is used to organize and interpret research observations and findings. • Before interpretation & communication of the findings, the raw data must be organized and presented in a clear and understandable way. Techniques used to organize and summarize a set of data in a concise way. • Organization of data • Summarization of data • Presentation of data 46 5/12/2023
  • 47. Cont... • Numbers that have not been summarized and organized are called raw data Descriptive statistic includes tables, graphical /chart displays and calculation of summary measures such as mean, proportions, averages etc… • The methods of describing variables differ depending on the type of data (Numerical or Categorical). 47 5/12/2023
  • 48. Organizing data Categorical data • Table of frequency distributions • Frequency • Relative frequency • Cumulative frequencies • Graphs • Bar charts • Pie charts Continuous or discrete data • Frequency distribution tables • Summary measures Graphs • Histograms • Frequency polygons • Cumulative frequency polygons Leaf and steam Box and whisker Plots Scatter plot 48 5/12/2023
  • 49. Frequency distributions • A frequency distribution is a presentation of the number of times (or the frequency) that each value (or group of values) occurs in the study population. • Ordered array: A simple arrangement of individual observations in order of magnitude. • A simple and effective way of summarizing categorical data is to construct a frequency distribution table. • This is done by counting the number of observations falling into each of the categories, or levels of the variables. • Consider for example, the variable birth weight with levels ‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’. 49 5/12/2023
  • 50. Relative Frequency • Sometimes it is useful to compute the proportion, or percentages of observations in each category. • The distribution of proportions is called the relative frequency distribution of the variable. • Given a total number of observations, the relative frequency distribution is easily derived from the frequency distribution. 50 5/12/2023
  • 51. Cumulative frequency • Two other distributions are useful describing particularly ordinal data. • It tells nothing in nominal data. E.g. You will never say 70% are below blue color. • The cumulative frequency is the number of observations in the category plus observations in all categories smaller than it. • Cumulative relative frequency is the proportion of observations in the category plus observations in all categories smaller than it, and is obtained by dividing the cumulative frequency by the total number of observations. 51 5/12/2023
  • 52. Table 2. Distribution of birth weight of newborns between 1976-1996 at TAH. BWT Freq. Rel. Freq(%) Cum. Freq Cum.rel.freq.(%) Very low 43 0.4 43 0.4 Low 793 8.0 836 8.4 Normal 8870 88.9 9706 97.3 Big 268 2.7 9974 100_____ Total 9974 100 52 5/12/2023
  • 53. Frequency distribution for numerical data • Ordered array, further useful summarization may be achieved by grouping the data. • To group a set of observations we select a set of continuous, non overlapping intervals such that each value in the set of observations can be placed in one, and only one, of the intervals. • These intervals are usually referred to as class intervals. 53 5/12/2023
  • 54. •One of the first considerations when data are to be grouped is how many intervals to include •The question is how best can we organize such data. Imagine when we have huge data set which may not be manageable by eye. 5/12/2023 54
  • 55. Table 3. Frequencies of serum cholesterol levels for 1067 US males of ages 25-34, (1976-1980). ------------------------------------------------------------------------------------------------------------------------------- Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ------------------------------------------------------------------------------------------------------------------ 80-119 13 1.2 13 1.2 120-159 150 14.1 163 15.3 160-199 442 41.4 605 56.7 200-239 299 28.0 904 84.7 240-279 115 10.8 1019 95.5 280-319 34 3.2 1053 98.7 320-359 9 0.8 1062 99.5 360-399 5 0.5 1067 100 ------------------------------------------------------------------------------------------------------------------ Total 1067 100 55 5/12/2023
  • 56. For both discrete and continuous data the values are grouped into non-overlapping intervals, usually of equal width. 56 5/12/2023
  • 57. Example of raw data of age…. 57 5/12/2023
  • 58. Example of categorized data of age 58 5/12/2023
  • 59. How to calculate class interval? To determine the number of class intervals and the corresponding width, we use:  Sturge’s rule: K=1+3.322(logn) W=L-S K where K = number of class intervals n = no. of observations W = width of the class interval L = the largest value S = the smallest value 59 5/12/2023
  • 60. Example • Construct a grouped frequency distribution of the following data on the amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week: 5/12/2023 60
  • 62. The amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week • Using the above formula, K = 1 + 3.322  log (80) = 7.32  7 classes • Maximum value = 38 and Minimum value = 10 • w= Range/k = (38 – 10)/7= 28/7 = 4 • Using width of 5(common rule of thumb), we can construct grouped frequency distribution for the above data as: 5/12/2023 62
  • 64. Mid-point and True-limits Mid-point (class mark): The value of the interval which lies midway between the lower and the upper limits of a class. True limits(class boundaries): Are those limits that make an interval of a continuous variable continuous in both directions Used for smoothening of the class intervals Subtract 0.5 from the lower and add it to the upper limit 64 5/12/2023
  • 65. Contd… • Note. In the construction of cumulative frequency distribution, if we start the cumulation from the lowest size of the variable to the highest size, the resulting frequency distribution is called `Less than cumulative frequency distribution' and if the cumulation is from the highest to the lowest value the resulting frequency distribution is called `more than cumulative frequency distribution.' The most common cumulative frequency is the less than cumulative frequency 5/12/2023 65
  • 66. Example Time (Hours) True limit Mid-point Frequency 10-14 15-19 20-24 25-29 30-34 35-39 9.5 – 14.5 14.5 – 19.5 19.5 – 24.5 24.5 – 29.5 29.5 – 34.5 34.5 - 39.5 12 17 22 27 32 37 8 28 27 12 4 1 Total 80 66 5/12/2023
  • 67. • Class interval: The length of the class, it is given by the difference between class boundaries for 1st class, the interval is 5. • Note: As sample increases, and interval reduced the sample distribution resembles the population distribution 5/12/2023 67
  • 68. • Class intervals should be continuous, non overlapping, mutually exclusive and exhaustive • Too few intervals results loss of information • Too many intervals results that the objective of summarization will not be met. • Class intervals generally should be of the same width (some times impossible) • Open ended class intervals should be avoided 68
  • 69. Exercise • Construct a grouped frequency distribution and complete the following table for the Age of patients (years) in a diabetic clinic in Addis Ababa, 2010 5/12/2023 69
  • 70. Age of patients (years) in a diabetic clinic in Addis Ababa, 2010 Age group (Years) Class limit Class Boundary Class Mid Point Tally Fr. (fi) Relative Frequency , Fraction (%) Cumulative freq Relative Cum freq <Method >Method <Method >Method Total 5/12/2023 70
  • 71. METHOD OF DATA PRESENTATION 5/12/2023 71
  • 72. Data table Guidelines for constructing tables •Keep them simple •Limit the number of variables •All tables should be self-explanatory •Include clear title telling what, where and when •Clearly label the rows and columns 72 5/12/2023
  • 73. Cntd… • State clearly the unit of measurement used • Explain codes and abbreviations in the foot- note • Show totals • If data is not original, indicate the source in foot-note 5/12/2023 73
  • 74. Graphical presentation of data • Variety of graph styles can be used to present data. • The most commonly used types of graph are pie charts, bar diagrams, histograms, frequency polygon and scatter diagrams. • The purpose of using a graph is to tell others about a set of data quickly, allowing them to grasp the important characteristics of the data. • In other words, graphs are visual aids to rapid understanding. 74 5/12/2023
  • 75. Importance of graphs • Diagrams have greater attraction than mere figures. • They give delight to the eye, add a spark of interest and as such catch the attention • They help in deriving the required information in less time and without any mental strain. • They have great memorizing value than mere figures. • They facilitate comparison 5/12/2023 75
  • 76. Bar charts • Bar chart: Display the frequency distribution for nominal or ordinal data. • In a bar chart the various categories into which the observation fall are represented along horizontal axis and • A vertical bar is drawn above each category such that the height of the bar represents either the frequency or the relative frequency of observation within the class. • The vertical axis should always start from 0 but the horizontal can start from any where. • The bars should be of equal width and should be separated from one another so as not to imply continuity 76 5/12/2023
  • 77. Figure 1. Bar charts showing frequency distribution of the variable ‘BWT’. 0 1000 2000 3000 4000 5000 6000 Very low Low Normal Big BWT Freq. 0 20 40 60 80 100 Verylow Low Normal Big BWT Rel. Freq. 77 5/12/2023
  • 78. Bar charts for comparison •Multiple bar chart: In order to compare the distribution of a variable for two or more groups, bars are often drawn along side each other for groups being compared in a single bar chart. •Sub division bar chart: If there are different quantities forming the sub-divisions of the totals, simple bars may be sub-divided in the ratio of the various sub- divisions to exhibit the relationship of the parts to the whole. 78 5/12/2023
  • 79. Fig 2. Bar chart indicating categories of birth weight of 9975 newborns grouped by antenatal follow-up of the mothers 9 88.9 2.1 7.9 89 3.1 0 10 20 30 40 50 60 70 80 90 100 Low Normal Big BWT Percent Yes No 79 5/12/2023
  • 80. Example: Plasmodium species distribution for confirmed malaria cases, Zeway, 2003 80 5/12/2023
  • 81. Pie chart Pie Chart: Displays the frequency distribution for nominal or ordinal data. • In a pie chart the various categories into which the observation fall are represented along sectors of a circle • Each sector represents either the frequency or the relative frequency of observation within the class the angles of which are proportional to frequency or the relative frequency. 81 5/12/2023
  • 82. Figure 3. Pie charts showing frequency distribution of the variable ‘BWT’ Fig 3(b) Pie chart indicating relative frequencyof categories of birth weight 0.4 8 88.9 2.7 Very low Low Normal Big Fig 3(a) Pie chart indicating frequencyof categories of birth weight 43 793 8870 268 Verylow Low Normal Big 82 5/12/2023
  • 83. Histogram • Histogram is frequency distributions with continuous class interval that has been turned into graph. • Given a set of numerical data, we can obtain impression of the shape of its distribution by constructing a histogram. • A histogram is constructed by choosing a set of non- overlapping intervals (class intervals) and counting the number of observations that fall in each class. . 83 5/12/2023
  • 84. Histograms cont… •The number of observations in each class is called the frequency. Hence histograms are also called frequency distributions •It is necessary that the class intervals be non-overlapping so that each observation falls in one and only one interval. 5/12/2023 84
  • 85. Histograms cont… • Except for the two boundaries, class intervals are usually chosen to be of equal width. If this is not the case, the histogram could give a misleading impression of the shape of the data • In drawing the histogram , smoothening of class interval is one of important point. We subtract 0.5 from the lower and add it up to the upper boundary of the given interval. 85 5/12/2023
  • 86. Example Distribution of the age of women at the time of marriage Age group No. of women 15-19 11 20-24 36 25-29 28 30-34 13 35-39 7 40-44 3 45-49 2 86 5/12/2023
  • 87. Age of women at the time of marriage 0 5 10 15 20 25 30 35 40 14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5 Age group No of women 87 5/12/2023
  • 88. Fig 5. A histogram displaying frequency distribution of birth weight of newborns at Tikur Anbessa Hospital Birth weight 5200 4800 4400 4000 3600 3200 2800 2400 2000 1600 1200 800 Frequency 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Std. Dev = 502.34 Mean = 3126 N = 9975.00 88 5/12/2023
  • 89. Frequency polygons • Instead of drawing bars for each class interval, sometimes a single point is drawn at the mid point of each class interval and consecutive points joined by straight line. • Graphs drawn in this way are called frequency polygons . • Frequency polygons are superior to histograms for comparing two or more sets of data. 89 5/12/2023
  • 90. Fig.6. Frequency polygon of birth weight of 9975 newborns at Tikur Anbessa Hospital for males and females Birth Weight 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 % 50 40 30 20 10 0 SEX Males Females 90 5/12/2023
  • 91. Box and Whisker Plot It is another way to display information when the objective is to illustrate certain locations (skewness) in the distribution Can be used to display a set of discrete or continuous observations using a single vertical axis – only certain summaries of the data are shown 91 5/12/2023
  • 92. Box plot cont... A box is drawn with the top of the box at the third quartile (75%) and the bottom at the first quartile (25%). The location of the mid-point (50%) of the distribution is indicated with a horizontal line in the box. Finally, straight lines, or whiskers, are drawn from the centre of the top of the box to the largest observation and from the centre of the bottom of the box to the smallest observation. 92 5/12/2023
  • 93. Box cont.... The box plot is then completed Draw a vertical bar from the upper quartile to the largest non-outlining value in the sample Draw a vertical bar from the lower quartile to the smallest non-outlying value in the sample Any values that are outside the IQR but are not outliers are marked by the whiskers on the plot (IQR = P75 – P25) 93 5/12/2023
  • 94. Box plots are useful for comparing two or more groups of observations 94 5/12/2023
  • 95. Drawing Box-and -whiskers plot Raw data 35, 29, 44, 72, 34, 64, 41, 50, 54, 104, 39, 58 Order the data 29 34 35 39 41 44 50 54 58 64 72 104 Median = (44 + 50)/2 = 47 = Q2 Q1 = 37 Q3 = 61,Min = 29 , Max = 104 95 5/12/2023
  • 96. Box plot Example 0 10 20 30 40 50 60 70 80 90 100 110 . . . . Min = 29 Q2 = 47 Q1 = 37 Q3 = 61 Max = 104 96 5/12/2023
  • 97. Scatter plot Most studies in medicine involve measuring more than one characteristic, and graphs displaying the relationship between two characteristics are common in literature. When both the variables are qualitative then we can use a multiple bar graph. When one of the characteristics is qualitative and the other is quantitative, the data can be displayed in box and whisker plots 97 5/12/2023
  • 98. Scatter plot …. For two quantitative variables we use bivariate plots (also called scatter plots or scatter diagrams). It is used to see whether a relationship existed between the two measures. A scatter diagram is constructed by drawing X-and Y-axes Each point represented by a point or dot() represents a pair of values measured for a single study subject =POSTIVE RELATION 98 5/12/2023
  • 99. Scatter plot • Scatter plot helps us to understand the association between two variables using: 1. The trend 2. The shape and 3. The strength Measure of association • Identifying very strong and very weak association is easy by observing the graph, but how we can classify everything in between? 5/12/2023 99
  • 100. Summary of data presentation-Insertion Study variable display method Remakes Both variable are qualitative Bar graph One qualitative and one quantitate Variable Box and whisker plot Used to see whether the data is skewed or not Both variable are quantitative Scatter plot It is used to see whether a relationship existed between the two measures. Both variables are quantitative Line graph Useful for assessing the trend of particular situation overtime, epidemic 5/12/2023 100
  • 101. Scatter plot • Linear correlation coefficient (R) measure the strength of association between 2 variables. • R values always range from -1 to 1 • R approaches to 1 shows a strong linear positive association • R approaches to -1 shows a strong linear negative association • R approaches to 0 shows a weak or no linear association • Note: values in between is somewhat subjective 5/12/2023 101
  • 102. 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 Hours of Training Accidents Negative Correlation as x increases, y decreases x = hours of training y = number of accidents Scatter Plots and Types of Correlation Accidents 102
  • 103. 300 350 400 450 500 550 600 650 700 750 800 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 Math SAT Positive Correlation as x increases y increases x = SAT score y = GPA GPA Scatter Plots and Types of Correlation 103
  • 104. 80 76 72 68 64 60 160 150 140 130 120 110 100 90 80 Height IQ IQ No linear correlation x = height y = IQ Scatter Plots and Types of Correlation 104
  • 105. 1. Direction of Relationship Positive Negative X X Y Y Scatter Diagram… 5/12/2023 105
  • 106. 2. Form of Relationship Linear Curvilinear X Y X Y 5/12/2023 106
  • 107. 3. Degree of Relationship Strong Weak X Y X Y 5/12/2023 107
  • 109. Line graph Useful for assessing the trend of particular situation overtime. e.g. monitoring the trend of epidemics. The time, in weeks, months or years, is marked along the horizontal axis Values of the quantity being studied is marked on the vertical axis. Values for each category are connected by continuous line. Sometimes two or more graphs are drawn on the same graph taking the same scale so that the plotted graphs are comparable. 109 5/12/2023
  • 110. No. of microscopically confirmed malaria cases by species and month at Zeway malaria control unit, 2003 0 300 600 900 1200 1500 1800 2100 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Months No. of confirmed malaria cases Positive P. falciparum P. vivax 110 5/12/2023
  • 111. Line graph cont.. The following graph shows level of zidovudine (AZT) in the blood of HIV/AIDS patients at several times after administration of the drug, for with normal fat absorption and with fat mal absorption.  Line graph can be also used to depict the relationship between two continuous variables like that of scatter diagram. 111 5/12/2023
  • 112. Line graph cont….. Response to administration of zidovudine in two groups of AIDS patients in hospital X, 1999 0 1 2 3 4 5 6 7 8 10 20 70 80 100 120 170 190 250 300 360 Time since administration (Min.) Blood zidovudine concentration Fat malabsorption Normal fat absorption 112 5/12/2023
  • 113. Choosing graphs Type of Data/or Purpose Appropriate Graphs Metric/Numerical -Histogram (one continuous var) -Frequency Polygon (one/more cont. var) -Cumulative Freq Polygon (ogive curve) -Box and whisker (one cont. and one cat. Var) -Stem and Leave (one cont. var) -Scatter (two cont. var) Categorical -Bar (one/more cat. var) (Simple/Multiple) -Pie (one cat. var) Trend -Line (one cont. and one cat. Var/two cont) 5/12/2023 113
  • 115. Summary Measures Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Coefficient of Variation Range Interquartile Range Geometric Mean Skewness Central Tendency Variation Shape Quartiles 5/12/2023
  • 116. MEASURES OF CENTRAL TENDENCY • The tendency of statistical data to get concentrated at certain values is called the “Central Tendency or average” • Mean • Median • Mode 5/12/2023
  • 117. The Arithmetic Mean or simple Mean •The mean is the average of the numbers. It is add up all the numbers, then divide by how many numbers there are • It is written statistical terms as 5/12/2023 𝑋 = 𝑖=1 𝑛 𝑥𝑖 𝑛
  • 118. Insertion Weighted mean 𝑋 = 𝑥1𝑤1 + 𝑥2𝑤2 … … . . +𝑥𝑖𝑤𝑖 𝑤1 + 𝑤2 … … … + 𝑤𝑖 = 𝑖=1 𝑛 𝑥𝑖𝑤𝑖 𝑖=1 𝑛 𝑤𝑖 X variable of interest W= weighing factor Mean of grouped data 5/12/2023 118   k 1 = i i k 1 = i i i f f m = x
  • 119. • Example 1: What is the Mean of these numbers? 6, 11, 7 • Add the numbers: 6 + 11 + 7 = 24 • Divide by how many numbers (there are 3 numbers): 24 / 3 = 8 • The Mean is 8 Why Does This Work? • It is because 6, 11 and 7 added together is the same as 3 lots of 8: • It is like you are "flattening out" the numbers. 5/12/2023
  • 120. Example 2 Birth weights(gm) of all live born infant born at a private hospital in a city, during a 1- week period. What is the arithmetic mean for the sample birth weights? 5/12/2023
  • 121. Weighted Mean •When averaging quantities, it is often necessary to account for the fact that not all of them are equally important in the phenomenon being described. •In order to give quantities being averaged there proper degree of importance, it is necessary to assign them relative importance called weights, and then calculate a weighted mean. 5/12/2023
  • 122. 5/12/2023 •The weighted mean of a set of numbers X1, X2, … and Xn, whose relative importance is expressed numerically by a corresponding set of numbers w1, w2, … and wn, is given by
  • 123. • Example: In a given drug shop fourdifferentdrugs were sold for unit price of 60, 85, 95 and 50 birr and the total numbers of drugs sold were 10, 10, 5 and 20 respectively. What is the average price of the four drugs in this drug shop? • Solution: for this example we have to use weightedmeanusing number of drugs sold as the respective weights for each drug's price. Therefore, the average price will be: 65 birr • If we don't consider the weights, the average price will be 72.5 birr Weighted mean= 𝟔𝟎∗𝟏𝟎+𝟖𝟓∗𝟏𝟎+𝟗𝟓∗𝟓+𝟓𝟎∗𝟐𝟎 𝟏𝟎+𝟏𝟎+𝟓+𝟐𝟎 =65 5/12/2023
  • 124. Weighted Mean • We can also calculate a weighted mean using some weighting factor: e.g. What is the average income of all people in cities A, B, and C : City Avg. Income Population A $23,000 100,000 B $20,000 50,000 C $25,000 150,000 Here, population is the weighting factor and the average income is the variable of interest      n i i n i i i w x w x 1 1 5/12/2023
  • 125. Self insertion Variable of interest= income Weighing factor= population Note • Here we have to first identify the variable of interest and the weighing factor. In this case • Income is the variable of interest and • Population is the weighting factor 125 5/12/2023      n i i n i i i w x w x 1 1 𝑋 = 𝑋𝑖𝑊𝑖 𝑊𝑖 𝑋 = 23000 ∗ 105 + 20000 ∗ 50000 + 250000 ∗ 150000 100000 + 50000 + 150000 𝑋 = 7050,000,000 300,000 =23500
  • 126. Geometric Mean • The Geometric Mean is a special type of average where we multiply the numbers together and then take a square root (for two numbers), cube root (for three numbers) etc. Example: What is the Geometric Mean of 2 and 18? • First we multiply them: 2 × 18 = 36 • Then (as there are two numbers) take the square root: √36 = 6 • Geometric Mean of 2 and 18 = √(2 × 18) = 6 • It is like the area is the same! 5/12/2023
  • 127. Self insertion Method for calculating the geometric mean There are two methods for calculating the geometric mean. Method A • Step 1. Take the logarithm of each value. • Step 2. Calculate the mean of the log values by summing the log values, then dividing by the number of observations. • Step 3. Take the antilog of the mean of the log values to get the geometric mean • GM= 10 ( 𝑖 = 1 𝑛 logxi 𝑛 ) 5/12/2023 127
  • 128. Method B • Step 1. Calculate the product of the values by multiplying all of the values together. • Step 2. Take the nth root of the product (where n is the number of observations) to get the geometric mean. GM= n 𝒙𝟏 ∗ 𝒙𝟐 … … ∗ 𝒙𝒏 where • GM= geometric mean • N= number of observations • n nth root 5/12/2023 128
  • 129. Example: What is the Geometric Mean of 10, 51.2 and 8? • First we multiply them: 10 × 51.2 × 8 = 4096 • Then (as there are three numbers) take the cube root: 3√4096 = 16 • For n numbers: multiply them all together and then take the nth root (written n√ ) • Geometric Mean = 3√(10 × 51.2 × 8) = 16 • It is like the volume is the same: 5/12/2023
  • 130. Estimating the Mean from Grouped Data Seconds Frequency 51 - 55 2 56 - 60 7 61 - 65 8 66 - 70 4 •The groups (51-55, 56-60, etc), also called class intervals, are of width 5 •The midpoints are in the middle of each class: 53, 58, 63 and 68 Someone timed 21 people in the race, to the nearest second: 5/12/2023
  • 131. Cntd… We can estimate the Mean by using the midpoints So, how does this work? Think about the 7 runners in the group 56 - 60: all we know is that they ran somewhere between 56 and 60 seconds: •Maybe all seven of them did 56 seconds, •Maybe all seven of them did 60 seconds, •But it is more likely that there is a spread of numbers: some at 56, some at 57, etc So we take an average and assume that all seven of them took 58 seconds. 5/12/2023
  • 132. Cntd… • Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8 people took 63 sec and 4 took 68 sec". In other words we imagine the data looks like this: • 53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68 • Then we add them all up and divide by 21. The quick way to do it is to multiply each midpoint by each frequency • And then our estimate of the mean time to complete the race is: • Estimated Mean = 1288 = 61.333… 21 5/12/2023   k 1 = i i k 1 = i i i f f m = x
  • 133. Correct mean • If a wrong figure has been used when calculating the mean the correct mean can be obtained with out repeating the whole process using: • Example: An average weight of 10 patients was calculated to be 65.Latter it was discovered that one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight. • solution
  • 134. The effect of transforming original series on the mean. • If a constant k is added/ subtracted to/from every observation then the new mean will be the old mean± k respectively. • If every observations are multiplied by a constant k then the new mean will be k*old mean
  • 135. Characteristics of mean • The value of the arithmetic mean is determined by every item in the series. • It is greatly affected by extreme values. Advantages • It is based on all values given in the distribution. • It is most easily understood. • It is most amenable to algebraic treatment. 5/12/2023
  • 136. Disadvantages • It may be greatly affected by extreme items and its usefulness as a “Summary of the whole” may be considerably reduced. • When the distribution has open-ended classes, its computation would be based assumption, and therefore may not be valid. 5/12/2023
  • 137. Median •Suppose there are n observations in a sample. If these observations are ordered from smallest to largest, then the median is defined as follows: •The sample median is 5/12/2023
  • 138. Example 2 2.1. Compute the sample median for the birth weight data in example 1. 2.2. Consider the following data, which consists of white blood counts taken on admission of all patients entering a small hospital on a given day. Compute the median white-blood count (103). 7, 35,5,9,8,3,10,12,8 5/12/2023
  • 139. Estimating the Median from Grouped Data • Let's look at our data again: The median is the middle value, which in our case is the 11th one, which is in the 61 - 65 group: We can say "the median group is 61 - 65" 5/12/2023
  • 140. Cntd… • We call it "61 - 65", but it really includes values from 60.5 up to (but not including) 65.5. • Why? the values are in whole seconds, so a real time of 60.5 is measured as 61. Likewise 65.5 is measured as 65. • At 60.5 we already have 9 runners, and by the next boundary at 65.5 we have 17 runners. By drawing a straight line in between we can pick out where the median frequency of n/2 runners is: 5/12/2023
  • 141. 5/12/2023 141 Seconds frequency C. frequency 50-55 2 2 56-60 7 9 61-65 8 17 66-70 4 21
  • 142. Cntd.. • L is the lower class boundary of the group containing the median • n is the total number of values • Cf is the cumulative frequency of the groups before the median group • Fmg is the frequency of the median group • w is the group width • For our example: • L = 60.5 • n = 21 • B = 2 + 7 = 9 • G = 8 • w = 5 = 61.4375 Estimated Median = L + where (n/2) − cf *w fmg Estimated Median = 60.5 + (21/2) − 9 × 5 8 5/12/2023
  • 143. i) Characteristics of Median • It is an average of position/location . • It is affected by the number of items than by extreme values. ii) Advantages • It is easily calculated and is not much disturbed by extreme values • It is more typical of the series • The median may be located even when the data are incomplete, e.g, when the class intervals are irregular and the final classes have open ends. 5/12/2023
  • 144. iii) Disadvantages • it is determined mainly by the middle points in a sample and is less sensitive to the actual numerical values of the remaining data points. • It is not so generally familiar as the arithmetic mean 5/12/2023
  • 145. Mode • It is the value of the observation that occurs with the greatest frequency. • A particular disadvantage is that, with a small number of observations, there may be no mode. • In addition, sometimes, there may be more than one mode such as when dealing with a bimodal (two-peak) distribution. • Find the modal values for the following data a) 22, 66, 69, 70, 73. (No modal value) b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg) 5/12/2023
  • 146. Estimating the Mode from Grouped Data • We can easily find the modal group (the group with the highest frequency), which is 61 - 65 • We can say "the modal group is 61 - 65" Estimated Mode = L+ fm − fm-1 × w (fm − fm-1) + (fm − fm+1) 5/12/2023
  • 147. Cntd… • where: • L is the lower class boundary of the modal group • fm-1 is the frequency of the group before the modal group • fm is the frequency of the modal group • fm+1 is the frequency of the group after the modal group • w is the group width • In this example: • L = 60.5 • fm-1 = 7 • fm = 8 • fm+1 = 4 • w = 5 Estimated Mode = 60.5 + 8 − 7 × 5 (8 − 7) + (8 − 4) = 60.5 + (1/5) × 5 = 61.5 5/12/2023
  • 148. Mode Characteristics • It is an average of position • It is not affected by extreme values • It is the most typical value of the distribution Advantages • Since it is the most typical value it is the most descriptive average • Since the mode is usually an “actual value”, it indicates the precise value of an important part of the series. 5/12/2023
  • 149. Disadvantages:- • Unless the number of items is fairly large and the distribution reveals a distinct central tendency, the mode has no significance • It is not capable of mathematical treatment • In a small number of items the mode may not exist. 5/12/2023
  • 150. Skewness: • If extremely low or extremely high observations are present in a distribution, then the mean tends to shift towards those scores. Based on the type of skewness, distributions can be: • Negatively skewed distribution: occurs when majority of scores are at the right end of the curve and a few small scores are scattered at the left end. • Positively skewed distribution: Occurs when the majority of scores are at the left end of the curve and a few extreme large scores are scattered at the right end. • Symmetrical distribution: It is neither positively nor negatively skewed. A curve is symmetrical if one half of the curve is the mirror image of the other half. 5/12/2023
  • 151. Skewness… • Data can be "skewed", meaning it tends to have a long tail on one side or the other: • Negative Skew? • Why is it called negative skew? Because the long "tail" is on the negative side of the peak. • The mean is also on the left of the peak. 5/12/2023
  • 152. Skewness… The Normal Distribution has No Skew A Normal Distribution is not skewed. It is perfectly symmetrical. And the Mean is exactly at the peak. 5/12/2023
  • 153. Skewness… Positive Skew And positive skew is when the long tail is on the positive side of the peak, and some people say it is "skewed to the right". The mean is on the right of the peak value. 5/12/2023
  • 155. Measures of Dispersion • Which of the distributions of scores has the larger dispersion? 0 25 50 75 100 125 1 2 3 4 5 6 7 8 9 10 0 25 50 75 100 125 1 2 3 4 5 6 7 8 9 10 The upper distribution has more dispersion because the scores are more spread out 5/12/2023
  • 156. Measures of Dispersion • How “spread out” the numbers are about the centre? • Consider the following data sets: Mean Set 1: 60 40 30 50 60 40 70 50 Set 2: 50 49 49 51 48 50 53 50 • The two data sets given above have a mean of 50, but obviously set 1 is more “spread out” than set 2 how do we express this numerically? • Some of the commonly used measures of dispersion (variation) are: Range, inter quartile range, quartiles, percentiles, variance, standard deviation and coefficient of variation. 5/12/2023
  • 157. Range and Interquartile Rage • Range • Simplest and the crudest measure of variation • Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest • Ignores the way in which data are distributed • It wastes information for it takes no account of the entire data. • Sensitive to outliers • Interquartile Range • Eliminate some high- and low-valued observations and calculate the range from the remaining values • Interquartile range = 3rd quartile – 1st quartile = Q3 – Q1 5/12/2023
  • 158. Quartiles and Percentiles • The quartiles divide the distribution into four equal parts. • Deciles: If data is ordered and divided into 10 parts, then cut points are called Deciles • Percentiles: If data is ordered and divided into 100 parts, then cut points are called Percentiles 5/12/2023
  • 159. Quartiles • The 25th percentile is often referred to as the first quartile and denoted Q1. • The 50th percentile (the median) is referred to as the second or middle quartile and written Q2’ and • the 75th percentile is referred to as the third quartile, Q3. When we wish to find the quartiles for a set of data, the following formulas are used 5/12/2023
  • 160. Using the Five-Number Summary to Explore the Shape • Box-and-Whisker Plot: A Graphical display of data using 5-number summary: • The Box and central line are centered between the endpoints if data are symmetric around the median Minimum, Q1, Median, Q3, Maximum Min Q1 Median Q3 Max
  • 161. Distribution Shape and Box-and-Whisker Plot Right-Skewed Left-Skewed Symmetric Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
  • 162. Standard Deviation and Variance • show the scatter of the individual measurements around the mean of all the measurements in a given distribution. • The variance represents squared units and, therefore, is not an appropriate measure of dispersion when we wish to express this concept in terms of the original units. • To obtain a measure of dispersion in original units, we merely take the square root of the variance. The result is called the standard deviation. • Variance the average of the squared difference from the mean • Standard deviation is the square root of variance 5/12/2023
  • 163. Variance and Standard Deviation   1 2     n x x s i   N xi    2   Population Sample variance  SD 5/12/2023
  • 164. To calculate standard deviation 1. Calculate the mean x 2. Calculate the residual for each x x x  3. Square the residuals 2 ) ( x x  4. Calculate the sum of the squares  2   x x 5. Divide the sum in Step 4 by (n-1)   1 2    n x x 6. Take the square root of quantity in Step 5   1 2    n x x 5/12/2023
  • 165. Example- Find Standard Deviation of Ungroup Data Family No. 1 2 3 4 5 6 7 8 9 10 Size (xi) 3 3 4 4 5 5 6 6 7 7 5/12/2023
  • 166. i x x xi   2 x xi  Family No. 1 2 3 4 5 6 7 8 9 10 Total 3 3 4 4 5 5 6 6 7 7 50 -2 -2 -1 -1 0 0 1 1 2 2 0 4 4 1 1 0 0 1 1 4 4 20 5 10 50     n x x i   , 2 . 2 9 20 1 2 2       n x x s i 48 . 1 2 . 2   s Here, 5/12/2023
  • 167. Example • The length of a newborn baby are: 600mm, 470mm, 170mm, 430mm and 300mm. • Find out the Mean, the Variance, and the Standard Deviation. • Your first step is to find the Mean: • Answer: • Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394 5 5 • so the mean (average) height is 394 mm. • Let's plot this on the chart: 5/12/2023
  • 169. To calculate the Variance, take each difference, square it, and then average the result: Standard Deviation σ = √21,704 = 147.32... = 147 (to the nearest mm) 5/12/2023
  • 171. • And the good thing about the Standard Deviation is that it is useful. Now we can show which lengths are within one Standard Deviation (147mm) of the Mean: • So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra long or extra short. 5/12/2023
  • 172. Why square the differences? • If we just add up the differences from the mean ... the negatives cancel the positives: • 4+4-4-4 =0 • 4 So that won't work. How about we use absolute values? 7+1+|-6|+|-2| = 4 but if we use square root 4 √(72 + 12 + 62 + 22) = √(904) = 4.74... 4 5/12/2023
  • 173. Coefficient of Variation • Measures relative variation • Always in percentage (%) • Shows variation relative to mean • Can be used to compare two or more sets of data measured in different units • Computed from SD and Mean (dividing SD by Mean) 100% X SD CV           5/12/2023
  • 174. Basic principles of probability, rules and its applications 5/12/2023 174
  • 175. 5/12/2023 Probability • Probability is the language of chance. • The deliberate use of chance is the central idea of statistical designs for producing data. • Probability is so important for data – leaders of the distribution as maps for a journey • Probabilities are used in everyday communication • Probability theory was developed out of attempting to solve problems related to games of chance such as tossing a coin, rolling a die etc. i.e. trying to quantify personal beliefs regarding degrees of uncertainty.
  • 176. 5/12/2023 Question from Simple Probabilities 1. What is the probability that a card drawn at random from a deck of cards will be an ace ? 4/52=1/13=0.076923 2. A book contains 32 pages numbered 1, 2, ..., 32. If a student randomly opens the book, what is the probability that the page number contains digit 1? 1,11,21,31 therefore 4/32= 1/8=0.125 3.A mother in the delivery room to give birth and the health worker informed her as she will deliver at 9:30 pm. She is eager to give birth of a daughter. What is the probability that she will get what she wants? ½=0.5
  • 177. 5/12/2023 Chance • When a meteorologist states that the chance of rain is 50%, the meteorologist is saying that it is equally likely to rain or not to rain. If the chance of rain rises to 80%, it is more likely to rain. If the chance drops to 20%, then it may rain, but it probably will not rain. • These examples suggest the chance of an occurrence of some event of a random variable.
  • 178. Basic terms •Experiment: Is any activity from which result can be obtained. •Example: 1. flipping a coin 2. rolling a die 3. drawing 30 individual from the pop •Sample space: set of possible outcome from the experiment Example: 1. coin toss {H, T} 2. Rolling a die {1, 2, 3, 4, 5, 6} •Event: a collection of outcomes 5/12/2023
  • 179. • The Sample Space is all possible outcomes. • A Sample Point is just one possible outcome. • And an Event can be one or more of the possible outcomes. 5/12/2023
  • 180. Properties of probability 1. Possible outcome of probability range 0-1=0-100% 2. Generally if the two events are not exclusive or not disjoint the probability of two events happening is given by  P(AuB)=P(A)+P(B)-P(AnB) 3. If two events are mutually exclusive or disjoint then  P(AuB)=P(A)+P(B)  P(A AND B)=P(AnB)=0 4. If two events are independent then  P(AnB)=P(A).P(B)  P(AB)=P(A)  P(BA)=P(B) 5/12/2023
  • 181. Unions of Two Events •“If A and B are events, then the union of A and B, denoted by A∪B, represents the event composed of all basic outcomes in A or B.” • Intersections of Two Events “If A and B are events, then the intersection of A and B, denoted by A∩B, represents the event composed of all basic outcomes in A and B.” 5/12/2023 Unions and Intersections B A
  • 182. Addition rules • Rule 1: If 2 events, B & C, are mutually exclusive (i.e., no overlap) then the probability that one or both occur is P(B or C) = P(B ∪ C) = P(B) + P(C) • Rule 2: For any given pair of events, if the sum of their probabilities is equal to one, then those two events are mutually exclusive. • Rule 3: For any 2 events, A & B, not mutually exclusive, the probability that one or both occur is P(A or B) = P(A∪B) = P(A)+P(B)-P(A n B) 5/12/2023
  • 183. • Example 1: One die is rolled. Sample space = S = (1, 2, 3, 4, 5, 6) Let A = the event an odd number turns up, A = (1, 3, 5) Let B = the event a 1, 2 or 3 turns up; B = (1, 2, 3) Let C = the event a 2 turns up, C= (2) I) Find Pr (A); Pr (B) and Pr (C) • Pr (A) = Pr (1) + Pr (3) + Pr (5) = 1/6+1/6+ 1/6 = 3/6 = 1/2 • Pr (B) = Pr (1) + pr (2) + Pr (3) = 1/6+1/6+1/6 = 3/6 = ½ • Pr (C) = Pr (2) = 1/6 II) Are A and B; A and C; B and C mutually exclusive? • A and B are not mutually exclusive. Because they have the elements 1 and 3 in common • Similarly, B and C are not mutually exclusive. They have the element 2 in common • A and C are mutually exclusive. They don’t have any element in common 5/12/2023
  • 184. The Addition . . . If two events A and B are not mutually exclusive, then, P (A U B) = P (A) + P (B) – P (A∩B) Example 1. There are 80 nurses and 40 physicians in a hospital. Of these, 70 nurses and 15 physicians are females. If a staff person is selected at random, find the probability that the subject is a nurse or male. Note= Or /union And/intersection P(N u M) = P(N) + P(M) – P(N n M) = 80/120 + 35/ 120 – 10/ 120 = 105/ 120 Male Female Total Nurse 70 Physician 25 15 40 Total 85 120 80 35 10
  • 185. Summary of the Additive Rule 5/12/2023
  • 186. Conditional probabilities and the multiplicative law • Let’s assume two questions on a test, the first question is a true/false and the second is a multiple question type with five possible answers (a, b, c, d, e) • True or False: Heart is an organ which pumps blood in our body. • MCQ: Which of the following human organ is used for breathing? a. Brain b. Liver c. Lung d. Kidney e. Heart • If the answers are random guesses the 10 possible outcomes are equally likely so 5/12/2023
  • 187. • A tree diagram is a picture of the possible outcomes of a procedure 5/12/2023
  • 189. Multiplicative Rule • When two events are said to be independent of each other, what this means is that the probability that one event occurs in no way affects the probability of the other event occurring. • For any two events A and B with non-zero probability are Independent events, each of the following must be true: • P (AB)= P(A) , and P(BA)= P(B) ; and so, P(A and B)= P(A) P(B) 5/12/2023
  • 190. • Eg. 1) A classic example is n tosses of a coin and the chances that on each toss it lands heads. These are independent events. The chance of heads on any one toss is independent of the number of previous heads. No matter how many heads have already been observed, the chance of heads on the next toss is ½. • Eg 2) a similar situation prevails with the sex of offspring. The chance of a male is approximately ½. Regardless of the sexes of previous offspring, the chance the next child is a male is still ½. 5/12/2023
  • 191. • Sometimes the chance a particular event happens depends on the outcome of some other event. This applies obviously with many events that are spread out in time • Eg. The chance a patient with some disease survives the next year depends on his having survived to the present time. Such probabilities are called conditional. • The notation is Pr (B/A), which is read as “the probability event B occurs given that event A has already occurred.” • Let A and B be two events of a sample space S. The conditional probability of an event A, given B, denoted by Pr (A/B) = P (A n B) / P (B), P (B)  0. 5/12/2023
  • 192. • Similarly, P (B/A) = P(A n B) / P(A) , P(A)  0. This can be taken as an alternative form of the multiplicative law. • Where for non-independent events A and B • P (A and B) = P (A/B) P(B) or P(A and B)= P(B/A)P(A) • Eg. Suppose in country X the chance that an infant lives to age 25 is .95, whereas the chance that he lives to age 65 is .65. For the latter, it is understood that to survive to age 65 means to survive both from birth to age 25 and from age 25 to 65. What is the chance that a person 25 years of age survives to age 65? 5/12/2023
  • 193. Notation Event Probability A Survive birth to age 25 .95 A and B Survive both birth to age 25 and age 25 to 65 .65 B/A Survive age 25 to 65 given survival to age 25 ? 5/12/2023 Then, Pr (B/A) = Pr (A n B) / Pr (A) = .65/.95 = .684. That is, a person aged 25 has a 68.4 percent chance of living to age 65.
  • 194. Example 1)Consider selecting a child at random from a kindergarten; let A = event a child is infected with ascariasis, G = event a child has giardiasis. Suppose P(A) = .30, P(G) = .25, P(A n G) = .13. a) What’s the probability that a child randomly selected from the KG has giardiasis, given that we know s/he has ascariasis? Answer, P(GA)= P(A n B)/P(A) P(GA)=0.13/0.30=.43 the probability of a child having Giardiasis given that he has already get ascariasis is 43% b) What is the probability that a child randomly selected from the KG will test negative for these intestinal parasites? Answer P(A)+P(B)+P(C)= 0.30+0.25+P(C)= 1=P(C)=0.45 2. Of 200 senior students at a certain college, 98 are women, 34 are majoring in Biology, and 20 Biology majors are women. If one student is chosen at random from the senior class, what is the probability that the choice will be either a Biology major or a woman). Given n-=200, Known male=14(p=0.07), female majoring bio=20(p=0.1) other females= 78(p=0.39), others 88 (p=0.44) P(B uW)= P(B) +P(W)-p(B n W)= 0.17 + 0.48 -0.1= 0.55 5/12/2023
  • 195. 5/12/2023 Exercise: Calculating probability of an event Table 1: shows the frequency of cocaine use by gender among adult cocaine users _______________________________________________________________________________________________ Life time frequency Male Female Total of cocaine use _______________________________________________________________________________________________ 1-19 times 32 7 39 20-99 times 18 20 38 more than 100 times 25 9 34 -------------------------------------------------------------------------------------------- Total 75 36 111 ---------------------------------------------------------------------------------------------
  • 196. 5/12/2023 Questions 1.What is the probability of a person randomly picked is a male? 2. What is the probability of a person randomly picked uses cocaine more than 100 times? 3.Given that the selected person is male, what is the probability of a person randomly picked uses cocaine more than 100 times? 4.Given that the person has used cocaine less than 100 times, what is the probability of being female? 5.What is the probability of a person randomly picked is a male and uses cocaine more than 100 times?
  • 197. Summary for the Multiplicative Rule 5/12/2023
  • 198. 5/12/2023 Probability as a Numerical Measure of the Likelihood of Occurrence 0 1 .5 Increasing Likelihood of Occurrence Probability: The occurrence of the event is just as likely as it is unlikely.
  • 199. Permutations The number of possible permutations is the number of different orders in which particular events occur. The number of possible permutations are where r is the number of events in the series, n is the number of possible events, and n! denotes the factorial of n = the product of all the positive integers from 1 to n. Repeated events )! ( ! r n n r p N   5/12/2023
  • 200. Combinations When the order in which the events occurred is of no interest, we are dealing with combinations. The number of possible combinations is where r is the number of events in the series, n is the number of possible events, and n! denotes the factorial of n = the product of all the positive integers from 1 to n.  Nc  n r       n! r!(n  r)! 5/12/2023
  • 201. Bayes' Theorem •Bayes' Theorem shows the relationship between a conditional probability and its inverse. i.e. it allows us to make an inference from the probability of a hypothesis given the evidence to the probability of that evidence given the hypothesis and vice versa
  • 202. Bayes' Theorem •P(A|B) = P(B|A) P(A) P(B) •P(A) – the PRIOR PROBABILITY – represents your knowledge about A before you have gathered data. •e.g. if 0.01 of a population has schizophrenia then the probability that a person drawn at random would have schizophrenia is 0.01
  • 203. Bayes' Theorem •P(A|B) = P(B|A) P(A) P(B) •P(B|A) – the CONDITIONAL PROBABILITY – the probability of B, given A. •e.g. you are trying to roll a total of 8 on two dice. What is the probability that you achieve this, given that the first die rolled a 6?
  • 204. Bayes' Theorem •P(A|B) = P(B|A) P(A) P(B) •So the theorem says: •The probability of A given B is equal to the probability of B given A, times the prior probability of A, divided by the prior probability of B.
  • 205. 5/12/2023 Probability distribution • Every random variable has a corresponding probability distribution. • A probability distribution applies the theory of probability to describe the behavior of the random variable. • The term probability distribution or just distribution refers to the way data are distributed, in order to draw conclusions about a set of data. • A probability distribution of a random variable can be displayed by a table or a graph or a mathematical formula. • With categorical variables, we obtain the frequency distribution of each variable. • With numeric variables, the aim is to determine whether or not normality may be assumed.
  • 206. 5/12/2023 I. Probability distribution of a categorical variables • The probability distribution of a categorical variable tells us with what probability the variable will take on the different possible values. • That is it specifies all possible outcomes of the categorical variable along with the probability that each will occur. E.g. Consider the value on the face showing up from tossing a die. The probability distribution of this variable is Value on Face 1 2 3 4 5 6 Probability 1/6 1/6 1/6 1/6 1/6 1/6 • Notice that the total probability is 1.
  • 207. 5/12/2023 Bernoulli Distribution •A random experiment with only one experiment with probability p and q; where p+q=1, is called Bernoulli trials •The outcome of an experiment can either be success (i.e., 1) and failure (i.e., 0). •Pr(X=1) = p, Pr(X=0) = 1-p, or •E[X] = p, Var(X) = p(1-p) •Bernoulli trial is a random experiment with only two possible outcomes
  • 208. 5/12/2023 Binomial distribution • In general the binomial distribution involves three assumptions • There are fixed n number of trials each of which results in one of two mutually exclusive outcomes. • the outcomes of n trials are independent. • the probability of “success” is constant for each trial • Pr (X=success) = Pr (X=1) = p • Pr (X=failure) = Pr (X=0) = 1-p P(k)  n k      pk 1 p  n  k
  • 209. The binomial distribution A process that has only two possible outcomes is called a binomial process. In statistics, the two outcomes are frequently denoted as success and failure. Binomial distribution is a sum of independent and evenly distributed Bernoulli trials. The binomial distribution gives the probability of exactly k successes in n trials P(k)  n k      pk 1 p  n  k 5/12/2023
  • 210. 5/12/2023 Binomial distribution…. • In addition to the probabilities of individual outcomes, we can also compute the numerical summary measures associated with a probability distribution. • The mean and variance values for a binomial distribution or the average number of successes in repeated samples of n is equal to • Example 1: From the sample of 1000 US population, there are 290 smokers, if we want to get the mean and standard deviation of the proportion of smokers, we can use the formula of the following; • Mean=nxp=1000x0.29=290 ______________ S.d = √1000(0.29X0.71) = 14.4 np    V  npq
  • 211. 5/12/2023 Binomial distribution…. Example 2: Suppose that in a certain population 52% of all recorded births are males. If we select randomly 10 birth records What is the probability that exactly •5 will be males? Given n=10, x=5, • Pr (X= x) = n! p x (1- p) n- x x ! (n -x )! So Pr (X=5) = 10! X 0.52 5 x (1- 0.52)10-5 =0.24 5!(10-5)! •3 or more will be females? • Pr(X≥3) = 1- Pr (X<3) = 1-[Pr(X=0)+Pr(X=1)+Pr(X=2)] =1-[0.001+0.013+0.055]= 1-0.069=0.931
  • 212. Random variable and Probability distributions • A random variable is a variable that has a single numerical value, determined by chance, for each outcome of a procedure. • A discrete random variable has either a finite number of values or a countable number of values. Eg. The number of eggs that a hen lays in a day(possible values are 0, or 1, or 2 • A continuous random variable has infinitely many values, and those values can be associated with measurements on a continuous scale in such a way that there are no gaps or interruptions. Eg. Voltage of electricity 5/12/2023
  • 213. Every probability distribution must satisfy each of the following two requirements •Since the values of a probability distribution are probabilities, they must be numbers in the interval from 0 to 1. •Since a random variable has to take on one of its values, the sum of all the values of a probability distribution must be equal to 1. 5/12/2023
  • 214. Random Variable •A Random Variable is a set of possible values from a random experiment •Example: Tossing a coin: we could get Heads or Tails. •Let's give them the values Heads=0 and Tails=1 and we have a Random Variable "X": random possible random variable values events 0 H X = 1 T 5/12/2023
  • 215. • So: • We have an experiment (like tossing a coin) • We give values to each event • The set of values is a Random Variable 5/12/2023
  • 216. • Eg. Toss a coin 3 times. Let x be the number of heads obtained. Find the probability distribution of x . f (x) = Pr (X = xi) , i = 0, 1, 2, 3. • Pr (x = 0) = 1/8 …………………………….. TTT • Pr (x = 1) = 3/8 ……………………………. HTT THT TTH • Pr (x = 2) = 3/8 ……………………………..HHT THH HTH • Pr (x = 3) = 1/8 ……………………………. HHH • Probability distribution of X. • The required conditions are also satisfied. i) f(x)  0 ii)  f (xi) = 1 5/12/2023 X = xi 0 1 2 3 Pr(X=xi) 1/8 3/8 3/8 1/8
  • 217. The birth of a son or a daughter are mutually exclusive events because the two events will not happen at the same time. The birth of a daughter and the birth of carrier of the sickle-cell anemia allele are not mutually exclusive because the two events can happen at the same time (they are independent events). 5/12/2023
  • 218. 5/12/2023 Example : Sex Ratio in a Family of 3 • Assume that the probability of a boy = 1/2 and the probability of a girl = 1/2. i. How many possibilities are there for a family to have the sex distribution? ii. What is the probability of occurrence of each event? iii. What is the chance of 2 boys AND 1 girl? child #1 child #2 child #3 B B B B B G B G B B G G G B B G B G G G B G G G
  • 219. • Solution: i. 8 possibilities ii. The probability of each event is 1/8 ( 1/2 x 1/2 x 1/2). iii. The chances of 2 boys AND 1 girl are 3. This occurs: BBG, BGB, and GBB. • Thus, the chance is 1/8 + 1/8 + 1/8 = 3/8. 5/12/2023
  • 220. The expected value of a discrete random variable The expected value, denoted by E(x) or , represents the “average” value of the random variable. It is obtained by multiplying each possible value by its respective probability and summing over all the values that have positive probability. Definition: The expected value of a discrete random variable is defined as E(X) =  = ) x P(X n x i 1 i i    5/12/2023
  • 221. Where the xi’s are the values the random variable assumes with positive probability Example: Consider the random variable representing the number of episodes of diarrhea in the first 2 years of life. Suppose this random variable has a probability mass function as below R 0 1 2 3 4 5 6 P(X = r) .129 .264 .271 .185 .095 .039 .017 What is the expected number of episodes of diarrhoea in the first 2 years of life? E(X) = 0(.129) +1(.264) +2(.271) +3(.185) +4(.095) +5(.039) +6(.017) = 2.038 Thus, on the average a child would be expected to have 2 episodes of diarrhoea in the first 2 years of life 5/12/2023
  • 222. The variance of a discrete random variable The variance represents the spread of all values that have positive probability relative to the expected value. In particular, the variance is obtained by multiplying the squared distance of each possible value from the expected value by its respective probability and summing overall the values that have positive probability. Definition: The variance of a discrete random variable denoted by X is defined by V(X) =           k 1 k 2 i 2 i i 2 i 2 1 μ ) x P(X x ) x P(X ) μ x ( σ i i Where the Xi’s are the values for which the random variable takes on positive probability. The SD of a random variable X, denoted by SD(X) or  is defined by square root of its variance. 5/12/2023
  • 223. Example: Compute the variance and SD for the random variable representing number of episodes of diarrhea in the first 2 years of life. E(X) =  = 2.04 ) x P(X n x i 1 i i    = 02 (.129) + 12 (.264) + 22 (.271) + 32 (.185) + 42 (.095) + 52 (.039) + 62 (0.017) = 6.12 Thus, V(X) = 6.12 – (2.04)2 = 1.967 and the SD of X is 1.402 1.967 σ   5/12/2023
  • 224. 5/12/2023 Binomial distribution, generally X n X n X p p         ) 1 ( 1-p = probability of failure p = probability of success X = # successes out of n trials n = number of trials Note the general pattern emerging  if you have only two possible outcomes (call them 1/0 or yes/no or success/failure) in n independent trials, then the probability of exactly X “successes”=
  • 225. 5/12/2023 Exercise 1. Each child born to a particular set of parents has a probability of 0.25 of having blood type O. If these parents have 5 children. What is the probability that a. Exactly two of them have blood type O=0.3516 b. At most 2 have blood type O=0.5592 c. At least 4 have blood type O=0.8229 d.2 do not have blood type O.=
  • 226. Exercise…. 2. Suppose past experiences in a certain malarious area indicated that the probability of a person with a high fever will be positive for malaria is 0.7. Consider 3 randomly selected patients (with high fever) in that same area. a) What is the probability that no patient will be positive for malaria?=0.027 b) What is the probability that exactly one patient will be positive for malaria?=0.189 c) What is the probability that exactly two of the patients will be positive for malaria?=0.441 d) What is the probability that all patients will be positive for malaria?=0.343 5/12/2023
  • 227. The Poisson distribution When the probability of “success” is very small, e.g., the probability of a mutation, then pk and (1 – p)n – k become too small to calculate exactly by the binomial distribution. In such cases, the Poisson distribution becomes useful. Let l be the expected number of successes in a process consisting of n trials, i.e., l = np. The probability of observing k successes is The mean and variance of a Poisson distributed variable are given by  = l and V = l, respectively. P(k)  lkel k! 5/12/2023
  • 228. 5/12/2023 Plots of Poisson Distribution
  • 229. 5/12/2023 The Poisson distribution… •Example 3. Suppose x is a random variable representing the number of individuals involved in a road accident each year (In US 2.4 are involved per 10,000 population each year) •I.e. λ = 2.4 per 10000 •Pr (X=0) = e-2.4 2.40 = 0.091 0! •Pr (X=1) = e-2.4 2.41 = 0.218 1! •Pr (X=2) = e-2.4 2.42 = 0.262 2!
  • 230. 5/12/2023 II. Probability distribution of Numeric variables 1. Probability distribution of a discrete variable •Let X be a discrete random variable, such as number of new AIDS cases reported during one year period, number of children in a family •To construct the probability distribution for X we list each of the values x the variable assumes and its associated probability (relative frequency).
  • 231. 5/12/2023 Characteristics of a distribution •Features commonly used to describe a distribution are location, dispersion, modality and skewness. •Location tells us something about the average value of the variable. •Dispersion tells us something about how spread out, the values of the variable are. •Modality refers to the number of peaks in the distribution. •Skew ness refers to whether or not the distribution is symmetric •A distribution is said to be symmetric if it is symmetrically distribute about its mode.
  • 232. 5/12/2023 2.Probability distribution of continuous variables •Under different circumstances, the outcome of a random variable may not be limited to categories or counts. •E.g. Suppose, X represents the continuous variable ‘Height’; rarely is an individual exactly equal to 170cm tall • X can assume an infinite number of intermediate values 170.1, 170.2, 170.3 etc. •Because a continuous random variable X can take on an uncountably infinite number of values, the probability associated with any particular one value is almost equal to zero.
  • 233. 5/12/2023 Continuous Random Variables • A smooth curve describes the probability distribution of a continuous random variable. •The depth or density of the probability, which varies with x, may be described by a mathematical formula f (x ), called the probability distribution or probability density function for the random variable x.
  • 234. 5/12/2023 Properties of Continuous Probability Distributions • The area under the curve is equal to 1. • P(a  x  b) = area under the curve between a and b. •There is no probability attached to any single value of x. That is, P(x = a) = 0.
  • 235. 5/12/2023 Continuous Probability Distributions • There are many different types of continuous random variables • We try to pick a model that • Fits the data well • Allows us to make the best possible inferences using the data. • One important continuous random variable is the normal random variable.
  • 236. 5/12/2023 The Normal(Gaussian) Distribution •The normal distribution is used extensively in the analyses of continuous variables and has an especially important role in statistics. •It has been found to be a good approximation for many distributions that arise in practice. •The normal distribution is a uni-modal and symmetric. •The normal distribution is completely described by two parameters, referred as the mean μ (read as ‘mu’) and standard deviation σ (read ‘sigma’). •The mean μ can be any number (negative, positive or zero). •The standard deviation σ must be a positive number. •The mean μ defines the location of the distribution and the SD (standard deviation) σ defines the dispersion of the distribution about the mean.
  • 237. 5/12/2023 The Normal Distribution deviation. standard and mean population the are and 1416 . 3 7183 . 2 for 2 1 ) ( 2 2 1                      e x e x f x • The shape and location of the normal curve changes as the mean and standard deviation change. • The formula that generates the normal probability distribution is:
  • 238. How the Normal curve shifts change when parameters 1 X-μ -1 0 0 1 X-μ -1 𝜎 - -a μ a X 0 1
  • 239. location (μ) different 𝜎 (S.D) Same but 𝜎=1 𝜎-2 𝜎=3 μ Biostatistics course by Girma Taye (PhD), AAU Empirical rule 68%=𝜎=1 means 68% of the x values lies within 1𝜎 from the mean 95%= 𝜎=2 means 95% of the x values lies within 2𝜎 from the mean 99.7%=𝜎=3 means 99.7% of the x values lies within 3𝜎 from the mean
  • 240. Same 𝜎 but different location (mean) μ=0 μ=1 μ=2 Biostatistics course by Girma Taye (PhD), AAU
  • 241. 5/12/2023 The standard normal distribution • Since a normal distribution could be an infinite number of possible values for its mean and SD, it is impossible to tabulate the area associated for each and every normal curve. • Instead only a single curve for which μ = 0 and σ = 1 is tabulated. • The curve is called the standard normal distribution (SND).
  • 242. 5/12/2023 The Standard Normal Distribution •To find P(a < x < b), we need to find the area under the appropriate normal curve. •To simplify the tabulation of these areas, we standardize each value of x by expressing it as a z- score, the number of standard deviations  it lies from the mean .     x z
  • 243. 5/12/2023 The Standard Normal (z) Distribution • Mean = 0; Standard deviation = 1 • When x = , z = 0 • Symmetric about z = 0 • Values of z to the left of center are negative • Values of z to the right of center are positive • Total area under the curve is 1.
  • 244. 5/12/2023 Using normal table The four digit probability in a particular row and column of Table 1 gives the area under the z curve to the left that particular value of z. Area for z = 1.36
  • 245. P(z 1.36) = .9131 P(z >1.36) = 1 - .9131 = .0869 P(-1.20  z  1.36) = .9131 - .1151 = .7980 5/12/2023 Example Use Table 1 to calculate these probabilities:
  • 246. 5/12/2023 Example The weights of packages of ground beef are normally distributed with mean 1 pound and standard deviation .10. What is the probability that a randomly selected package weighs between 0.80 and 0.85 pounds?    ) 85 . 80 (. x P      ) 5 . 1 2 ( z P 0440 . 0228 . 0668 .  
  • 247. 5/12/2023 Example What is the weight of a package such that only 1% of all packages exceed this weight? 233 . 1 1 ) 1 (. 33 . 2 ? 33 . 2 1 . 1 ? 1, Table From 01 . ) 1 . 1 ? ( 01 . ?) (           z P x P
  • 248. 5/12/2023 Approximating the Binomial Make sure to include the entire rectangle for the values of x in the interval of interest. This is called the continuity correction. Standardize the values of x using npq np x z   Make sure that np and nq are both greater than 5 to avoid inaccurate approximations!
  • 249. Exercise A data collected on systolic blood pressure in normal healthy individuals is normally distributed with μ= 120 and σ= 10 mm Hg. 1)What proportion of normal healthy individuals have a systolic blood pressure above 130 mm Hg-=0.8554 2)What proportion of normal healthy individuals have a systolic blood pressure between 100 and 140 mm Hg?=0.9544 3)What level of systolic blood pressure cuts off the lower 95% of normal healthy individuals?=0.4772 5/12/2023
  • 250. μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ Fig.3. Percentage of area under a normal distribution with mean μ and standard deviation σ Empirical rule For any normal distribution,  about 68% (most) of the observations is contained within one SD of the mean. about 95% (majority) of the probability is contained within two SDs and 99.7% (almost all) within three SDs of the mean. 5/12/2023
  • 251. 5/12/2023 Exercises • Find the probability of the following under the SND •Above 1.96? z>1.96= 1-0.4750=0.525 •Below –1.96? Z<-1.96=1-.4750=0.525 •Between –1.28 and 1.28? -1.28<z>1.28 •Between –1.65 and 1.08? 0.8502 •What level cuts the upper 25%? • =1-25=0.75 •What level cuts the middle 99%?=1- 0.99=0.01, 0.01/2=0.005
  • 252. Area between 0 and z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 5/12/2023 Table 1: Normal distribution
  • 253. t table with right tail probabilities dfp 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005 1 0.324920 1.000000 3.077684 6.313752 12.70620 31.82052 63.65674 636.6192 2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991 3 0.276671 0.764892 1.637744 2.353363 3.18245 4.54070 5.84091 12.9240 4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103 5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688 6 0.264835 0.717558 1.439756 1.943180 2.44691 3.14267 3.70743 5.9588 7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079 8 0.261921 0.706387 1.396815 1.859548 2.30600 2.89646 3.35539 5.0413 9 0.260955 0.702722 1.383029 1.833113 2.26216 2.82144 3.24984 4.7809 10 0.260185 0.699812 1.372184 1.812461 2.22814 2.76377 3.16927 4.5869 5/12/2023 Table 2: Student’s t-distribution