COM 201_Inferential Statistics_18032022.pptx

Introduction to Inferential Statistics
Sampling, Probability, Normal distribution, and
Hypothesis Testing
by
Adekunle Fakunle (PhD)

Outline
• Definitions of terms
• Probability
• Random Variables.
• Probability Distributions
• Binominal and Normal distribution
• Inferential statistics
• Types of inferential statistic

Important Definitions
• Probability – the chance that an uncertain
event will occur (always between 0 and 1)
• Impossible Event – an event that has no
chance of occurring (probability = 0)
• Certain Event – an event that is sure to
occur (probability = 1)
•.

Probability
•Outcomes: any possible result of probability
event
•Favorable outcomes: a successful result in a
probability event e.g rolling the #1 on a die
•Possible outcomes: All the result that could
occur during probability event
4

Random Variables
 A random variable is a numerical description
of the outcome of a statistical experiment
 A random variable that may assume only a
finite number or an infinite sequence of values
is said to be discrete; one that may assume
any value in some interval on the real number
line is said to be continuous

The sample Space, S
The sample space, S, for a random phenomena is
the set of all possible outcomes.

Examples
1. Tossing a coin – outcomes S ={Head, Tail}
2. Rolling a die – outcomes
S ={ , , , , , }
={1, 2, 3, 4, 5, 6}

An Event , E
The event, E, is any subset of the sample space, S. i.e.
any set of outcomes (not necessarily all outcomes) of the
random phenomena
S
E
Venn
diagram

The event, E, is said to have occurred if after the outcome
has been observed the outcome lies in E.
S
E

Examples
1. Rolling a die – outcomes
S ={ , , , , , }
={1, 2, 3, 4, 5, 6}
E = the event that an even number is
rolled
= {2, 4, 6}
={ , , }

Histogramshowing the heights of 10000 males
0
200
400
600
800
1000
1200
1400
140 148 156 164 172 180 188 More
Height (cm)
Frequency
A sample of heights of 10,000 adult males gave rise to the
following histogram:
Notice that this histogram is symmetrical and
bell-shaped. This is the characteristic shape of a
normal distribution.

The normal distribution is an appropriate model for many common
continuous distributions, for example:
If we were to draw a smooth
curve through the mid-points of
the bars in the histogram of
these heights, it would have the
following shape:
This is called the
normal curve.
The masses of new-born babies;
The IQs of school students;
The hand span of adult females;
The heights of plants growing in a field;
etc.

Area Under the Curve
•A normal distribution should be
viewed almost like a histogram
•Top Figure: The darker bars of
the histogram correspond to
ages ≤ 9 (~40% of distribution)
•Bottom Figure: shaded area
under the curve (AUC)
corresponds to ages ≤ 9 (~40%
of area)
2
2
1
2
1
)
(





 

 



x
e
x
f

Parameters μ and σ
•Normal distribution have two parameters
μ - expected value (mean “mu”)
σ - standard deviation (sigma)
7: Normal Probability Distributions 15
σ controls spread
μ controls location

Mean and Standard Deviation of
Normal Density
μ
σ

Standard Deviation σ
•Points of inflections
one σ below and above
μ
•Practice sketching
Normal curves
•Feel inflection points
(where slopes change)
•Label horizontal axis
with σ landmarks

Symmetry in the Tails
7: Normal Probability Distributions
… we can easily
determine the AUC in
tails
95%
Because the Normal
curve is symmetrical
and the total AUC is
exactly 1…

Assessing Departures
from Normality
Same distribution on
Normal “Q-Q” Plot
Approximately
Normal histogram
Normal distributions adhere to diagonal line on
Quantile-Quantile plot

Negative Skew
Negative skew shows upward curve on Q-Q plot

Positive Skew
Positive skew shows downward curve on Q-Q plot

What is inferential statistics?
•Inferential statistics is a technique used to draw
conclusions about a population by testing the data
taken from the sample of that population
•It is the process of how generalization from
sample to population can be made. It is assumed
that the characteristics of a sample is similar to
the population’s characteristics.
•It includes testing hypothesis and deriving
estimates

Inferential Statistic
Descriptive Statistics
•are used to organize and/or summarize the
parameters associated with data collection
(e.g., mean, median, mode, variance, standard
deviation)
Inferential Statistics
•are used to infer information about the
relationship between multiple samples or
between a sample and a population (e.g., t-
test, ANOVA, Chi Square).

The process of inferential analysis
Raw Data
• It comprises of all the data collected from the sample.
• Depending on the sample size, this data can be large or small set of
measurements.
Sample
Statistics
• It summarizes the raw data gathered from the sample of population
• These are the descriptive statistics (e.g. measures of central tendency)
Inferential
Statistics
• These statistics then generate conclusions about the population based
on the sample statistics.

Inferential Statistics
• Inferential statistics are used to draw conclusions about a
population by examining the sample
We want to
learn about
population
parameter
s…
…but we
can only
calculate
sample
statistics

Parameters and Statistics
We are going to illustrate inferential concept by considering how
well a given sample mean “x-bar” reflects an underling population
mean µ
x
µ

•Accuracy of inference depends on
representativeness of sample from population
•Random selection
•equal chance for anyone to be selected
makes sample more representative

•Inferential statistics help researchers test
hypotheses and answer research questions,
and derive meaning from the results
•a result found to be statistically significant by
testing the sample is assumed to also hold
for the population from which the sample was
drawn
•the ability to make such an inference is
based on the principle of probability

•Researchers set the significance level for
each statistical test they conduct
•by using probability theory as a basis for
their tests, researchers can assess how
likely it is that the difference they find is real
and not due to chance

Inferential Statistics Provide Two Environments:
•Test for Difference – To test whether a significant
difference exists between groups
•Tests for relationship – To test whether a
significant relationship exist between a
dependent (Y) and independent (X) variable/s
•Relationship may also be predictive

Hypothesis Testing Using Basic Statistics
•Univariate Statistical Analysis
•Tests of hypotheses involving only one variable
•Bivariate Statistical Analysis
•Tests of hypotheses involving two variables
•Multivariate Statistical Analysis
•Statistical analysis involving three or more
variables or sets of variables.

Hypothesis Testing Procedure
• H0 – Null Hypothesis
• “There is no significant difference/relationship
between groups”
• Ha – Alternative Hypothesis
• “There is a significant difference/relationship
between groups”
• Always state your Hypothesis/es in the Null form
• The object of the research is to either reject or accept
the Null Hypothesis/es

Examples
•Example 1: Three unrelated groups of people
choose what they believe to be the best color
scheme for a given website.
• The null hypothesis is: There is no difference
between color scheme choice and type of group
•Example 2: Males and Females rate their level of
satisfaction to a magazine using a 1-5 scale
• The null hypothesis is: There is no difference
between satisfaction level and gender

We can make two types of errors in hypothesis
testing:
In the population,
Ho actually is:
Not reject Ho Reject Ho
True Correct decision made Type 1 error
Researcher thinks there is
an actual relationship
between the variables when
there is not
False Type II error
There is an actual
relationship between
variables although
researcher has accepted null
hypothesis
Correct decision made

Concepts related to Sampling Error
• Sampling Error: The degree to which a sample differs
on a key variable from the population.
• Confidence Level:
The number of times out of 100 that the true value will
fall within the confidence interval.
• Confidence Interval:
A calculated range for the true value, based on the
relative sizes of the sample and the population.
• Why is Confidence Level Important? Confidence
levels, which indicate the level of error we are willing to
accept, are based on the concept of the normal curve
and probabilities. Generally, we set this level of
confidence at either 90%, 95% or 99%. At a 95%
confidence level, 95 times out of 100 the true value will
fall within the confidence interval.

We can theoretically draw numerous
samples from a population that
examine the value of one variable.
The more samples we draw from the
population, the more likely it is that
the frequency distribution of that
variable will resemble a normal
distribution

Important concepts about sampling
distributions:
•If a sample is representative of the population, the
mean (on a variable of interest) for the sample and the
population should be the same.
•However, there will be some variation in the value of
sample means due to random or sampling error. This
refers to things you can’t necessarily control in a study
or when you collect a sample.
•The amount of variation that exists among sample
means from a population is called the standard error of
the mean.
•Standard error decreases as sample size increases.

Significance Levels and p-values
• Significance Level
• A critical probability associated with a statistical
hypothesis test that indicates how likely an inference
supporting a difference between an observed value and
some statistical expectation is true.
• The acceptable level of Type I error.
• p-value
• Probability value, or the observed or computed
significance level.
• p-values are compared to significance levels to test
hypotheses

Testing for Significant Difference
•Testing for significant difference is a type of
inferential statistic
•One may test difference based on any type of
data
•Determining what type of test to use is based
on what type of data are to be tested.

Example: Types of Relationships
Positive Negative No Relationship
Income
($)
Education
(yrs)
Income
($)
Education
(yrs)
Income
($)
Education
(yrs)
20,000 10 20,000 18 20,000 14
30,000 12 30,000 16 30,000 18
40,000 14 40,000 14 40,000 10
50,000 16 50,000 12 50,000 12
75,000 18 75,000 10 75,000 16

Income and Education
0
20000
40000
60000
80000
1 2 3 4 5
Education
Income
Income
Education

In inferential statistics, the
hypothesis that is actually
tested in the null hypothesis.
Therefore, what we must do is
disprove that a relationship
between the variables does not
exist.

Different types of inferential
statistics

Chi Square
A chi square (X2) statistic is used to investigate
whether distributions of categorical (i.e.
nominal/ordinal) variables differ from one
another.

General Notation for a chi square 2x2
Contingency Table
Variable 2 Data Type 1 Data Type 2 Totals
Category 1 a b a+b
Category 2 c d c+d
Total a+c b+d a+b+c+d
𝑥2 =
𝑎𝑑 − 𝑏𝑐 2 𝑎 + 𝑏 + 𝑐 + 𝑑
𝑎 + 𝑏 𝑐 + 𝑑 𝑏 + 𝑑 𝑎 + 𝑐
Variable 1

T test
2
1
2
1
x
x
S
x
x
t



Mean for group 1
Mean for group 2
Pooled, or combined, standard error of difference
between means
The pooled estimate of the standard error is a better
estimate of the standard error than one based of
independent samples.

1
x

2
x

 2
1 x
x
S

Uses of the t test
•Assesses whether the mean of a group of
scores is statistically different from the
population (One sample t test)
•Assesses whether the means of two groups of
scores are statistically different from each other
(Two sample t test)
•Cannot be used with more than two samples
(ANOVA)

ANOVA
• In statistics, analysis of variance (ANOVA) is a collection of
statistical models, and their associated procedures, in which
the observed variance in a particular variable is partitioned
into components attributable to different sources of
variation.
• In its simplest form ANOVA provides a statistical test of
whether or not the means of several groups are all equal,
and therefore generalizes t-test to more than two groups.
• Doing multiple two-sample t-tests would result in an
increased chance of committing a type I error. For this
reason, ANOVAs are useful in comparing two, three or more
means.

ANOVA Hypothesis Testing
•Tests hypotheses that involve comparisons of two or
more populations
•The overall ANOVA test will indicate if a difference
exists between any of the groups
•However, the test will not specify which groups are
different
•Therefore, the research hypothesis will state that
there are no significant difference between any of
the groups
𝐻0: 𝜇1 = 𝜇2 = 𝜇3

Regression Analysis
•The description of the nature of the relationship
between two or more variables
•It is concerned with the problem of describing
or estimating the value of the dependent
variable on the basis of one or more
independent variables.

Predictive Versus Explanatory Regression
Analysis
•Prediction – to develop a model to predict
future values of a response variable (Y) based
on its relationships with predictor variables (X’s)
•Explanatory Analysis – to develop an
understanding of the relationships between
response variable and predictor variables

Correlation Analysis
 Spearman's correlation:
It is the nonparametric version of the Pearson product-
moment correlation.
Spearman correlation is often used to evaluate
relationships involving ordinal variables.
For example, you might use a Spearman correlation to
evaluate whether the order in which employees
complete a test exercise is related to the number of
months they have been employed.

Correlation analysis
 Pearson correlation:
It is the test statistics that measures the statistical
relationship, or association, between two
continuous variables.
It is known as the best method of measuring the
association between variables of interest because it is
based on the method of covariance.
Pearson's correlation is used when you want to see if
their is a linear relationship between two
quantitative variables.

Simple Regression Model
𝑦 = 𝑎 + 𝑏𝑥
𝑺𝒍𝒐𝒑𝒆 𝒃 = (𝑁Σ𝑋𝑌 − Σ𝑋 Σ𝑌 ))/(𝑁Σ𝑋2 − Σ𝑋 2)
𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 𝒂 = (Σ𝑌 − 𝑏 Σ𝑋 )/𝑁
Where:
y = Dependent Variable
x = Independent Variable
b = Slope of Regression Line
a = Intercept point of line
N = Number of values
X = First Score
Y = Second Score
ΣXY = Sum of the product of 1st & 2nd scores
ΣX = Sum of First Scores
ΣY = Sum of Second Scores
ΣX2 = Sum of squared First Scores

y
x
Predicted Values
Actual Values
i
Y
i
Y
i
r ˆ


Residuals
Slope (b)
Intercept (a)
Simple regression model

COM 201_Inferential Statistics_18032022.pptx

Recommended

Recommended

More Related Content

Similar to COM 201_Inferential Statistics_18032022.pptx

Similar to COM 201_Inferential Statistics_18032022.pptx (20)

Recently uploaded

Recently uploaded (20)

COM 201_Inferential Statistics_18032022.pptx