BasicStatistics.pdf

Sweet AI
Variables
Variables
Quantitative
(Histogram)
Discrete
(number of students in a class)
Continuous
(weight)
Interval (Temp)
Ratio (Height, Age)
Categorical/ Qualitative
(Bar plot)
Binary
(spam/safe)
Nominal
(non-sortable: colors, genre)
Ordinal
(sortable: grades, product rating)

Sweet AI
Probability
Probability
Independent event
Dependent event Conditional probability P(A|B) =
P A∩𝐵
𝑃(𝐵)
Multiplication rule/ Intersection
Depended event:
P A ∩ 𝐵 = 𝑃 𝐴 ∗ 𝑃 𝐵 𝐴 𝑜𝑟 𝑃 𝐵 ∗ 𝑃(𝐴|𝐵)
Indepenedent event:
P A ∩ 𝐵 = 𝑃 𝐴 ∗ 𝑃(𝐵)
Addition rule/ Union P A ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − P A ∩ 𝐵
Complement rule 𝑃(𝐴 ) = 1 − 𝑃(𝐴)
Bayes Theorem P(A|B) =
𝑃 B 𝐴 𝑃(𝐴)
𝑃(𝐵)
Permutation (order matter)
n: number of set, r: number of spots
Repetition nr ex: AB, BA, AA, BB
No repetition
𝑛!
(𝑛−𝑟)!
ex: AB, BA
Combination (order doesn’t matter)
Repetition
(𝑛+𝑟 −1)!
𝑟!(𝑛−1)!
ex: AA, BA, BB
No repetition
𝑛!
𝑟!(𝑛−𝑟)!
ex: AB

Sweet AI Basic Concepts
Concept Description
Population The entire dataset that you want to draw conclusions about. e.g., all the school’s students of the USA
Sample
A smaller set randomly drawn from the population. e.g., 700 volunteer students from different schools in the USA
Outlier/ Noise/
Anomalies
Datapoints that are at abnormal distance from the other observations, and they can skew the model.
Variate
Univariate à one variable
Bivariate à two variable
Multivariate à more than two variables
Sampling Methods
Probability
Simple
random
Systematic Stratified Cluster
Non-probability
Convenience Snowball Quota Purposive

Sweet AI
Statistical Measures
Statistical Measures
Central Tendency
Mean
Median
Mode
Central Dispersion
Range
Variance
Standard Deviation
IQR
Association
Covariance
Correlation

Sweet AI
Basic Measurement Concepts
Central
Tendency
Description Example
Mean/ Average
( 𝜇, ̅
𝑥 )
The total of the numbers divide by the number of numbers.
Sensitive to outlier.
[4, 3, 7, 2, 3, 6]: 4 + 3 + 7 + 2 + 3 + 6 = 25 / 7 à 4.16
Median ( Med, M) Sort the numbers and find the middle number [4, 3, 7, 2, 3, 6]: [2, 3, 3, 4, 6, 7] à 3.5
Mode The most common occurring number [4, 3, 7, 2, 3, 6]: à 3

Sweet AI
Central Dispersion
Dispersion Description Example
Range The difference between smallest and largest number [4, 3, 7, 2, 3, 6]: 7 – 2 à 5
Variance (𝜎2
) Shows how spread-out the data points are, and measures the width of the
distribution around mean
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝜎2 =
∑/01
2
(#$ % &)
(
𝑆𝑎𝑚𝑝𝑙𝑒: 𝑆2 =
∑/01
2
(#$ % #)
( %)
[4, 3, 7, 2, 3, 6]: 𝜇 = 5 à dist(-1, -2, 2, -3, -2, 1)2
à 𝜎2 = 23/6 = 3.83
Standard
Deviation (𝜎)
How spread out the data is around the mean and used to identify outliers.
data points that are more than one sd from mean might be consider
unusual 𝜎 = 𝜎2
[4, 3, 7, 2, 3, 6]: 𝜎 = 3.83 = 1.95
Standard
Error (SE)
Population Sd estimates how spread-out individual values are from the population mean.
Standard error estimates the accuracy of a sample and how far a sample mean is likely to be from the population mean.
𝑆𝐸 ̅
𝑥 =
*
+
(𝜎: population standard deviation, n: # datapoints in the sample) à return the result as mean ±𝑆𝐸

Sweet AI
Concept Description Example
Quartiles All datapoints are considered and sorted
ascendingly, find median, then find median of two
other sets:
• Q1: lower/ first
• Q2: median/ middle/ second
• Q3: upper/ third
• [2, 3, 3, 4, 6, 7, 8, 12, 19, 19, 24, 26]
• [ 2,3,3 | 4,6,7 | 8,12,19 | 19,24,26]
Interquartile
Range (IQR)
IQR = Q3 – Q1
Outlier = Q1 – 1.5 x IQR
Outlier = Q3 + 1.5 x IQR
Percentiles 99 values that split the sample into 100 equal size subsamples
Central Dispersion
Q1 Q3
IQR
Whisker
Whisker
Fence at 1.5 IQR

Sweet AI
Association
Association Description
Covariance Measures the relationship between two variable in two dimensions.
Positive value à two variables move in the same direction
Negative value à two variables move in the inverse direction
Closer to zero indicates weak relationship
Farther from 0 indicates stronger relationship
Pearson Correlation
Coefficient/ Pearson’s r
Measure the strength and direction of a
linear relationship two variables.
-1 (strong negative relationship) < r < +1
(strong positive relationship)
P = 0 à no correlation
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝐶𝑜𝑣 𝑋, 𝑌 =
∑(𝑋𝑖 − D
𝑋)(𝑌𝑖 − D
𝑌)
𝑁
𝑆𝑎𝑚𝑝𝑙𝑒: 𝐶𝑜𝑣 𝑋, 𝑌 =
∑(𝑋𝑖 − D
𝑋)(𝑌𝑖 −
D
𝑌)
𝑁 − 1
Image from U of Wisconsin.
𝜌𝑋, 𝑌 =
𝑐𝑜𝑣(𝑋, 𝑌)
𝜎𝑋𝜎𝑌
=
∑(𝑥𝑖 − ̅
𝑥)(𝑦𝑖 − 1
𝑦)
∑(𝑥𝑖 − ̅
𝑥)2 ∑ 𝑦𝑖 − 1
𝑦 2

Sweet AI
Distribution
Credit: Harold Toomey, WyzAnt Tutor

Sweet AI
Distribution
Discrete/ mass function
Continuous/ density function

Sweet AI
Hypothesis Testing
Hypotheses
Alternative (H1/Ha)
e.g., a male salary is higher than a female salary for a same job position
Null (H0)
e.g., a male salary is equal to a female salary for a same job position
Non-Directional
Directional
Statistical

Sweet AI
Hypothesis Testing
Hypothesis Test
Parametric Test
Regression
Simple Linear
Regression
Multiple Linear
Regression
Logistic Regression
Comparison
t-test
ANOVA
MANOVA
Correlation
Pearson's r
Non-Parametric Test
Spearman's r
Chi square test
ANOSIM
Wilcoxon
Sign test

Sweet AI
Hypothesis Testing
State H0 & H1 Collect testing samples
Select & Execute
Statistical Test
Infer the results
(Reject/fail to Reject H0)
Ho: Men are, on average,
not getting higher salary
than women.
Ha: Men are, on average,
getting higher salary
than women.
Equal proportion of men
& women in a variety of
industries in scope of a
country
One-tail t-test
Average diff 20k and p-
value 0.002, which is
consistent with H1

Terminology Description
Significance level
/confidence
level(𝛼)
A threshold to decide whether a test statistic is statistically significant. Statical significance means high likely a
relationship between variables is not caused by chance. 𝛼 is lays in the area inside the tail(s) of the H0
𝛼 = 1 – (confidence level /100) à Common practice 𝛼 : 0.01, 0.05, 0.1
P-value
(probability
Value)
Determines plausibility of null hypotheses, whether H0 should be rejected or not! P(Sample statistics| H0 True)
0 < p-value < 1
P-value ≥ 𝛼 : results are not statically significant, H0 not rejected/failed, the null must fly!
P-value < 𝛼 : results are statically significant, H0 rejected/failed, the null must go!
Sweet AI Basic Concepts
H0 is ... True False
Rejected Type I Error
𝛼
Correct
Not Rejected Correct Type II Error
𝛽
https://www.abtasty.com

• Used to test if two groups of data are different from each other and we don’t know standard deviation of population
• Normal Distribution Formula:
• To calculate percentile of a datapoint we should standardize a Normal Distribution to a Standard Normal Distribution
• Standard Normal Distribution has 𝜇 = 0 & 𝜎 = 1
• How to determine x’s percentile/probability or how far from typical is this result?
1. Standardize the values of normal distribution and calculating z-score by population 𝜇 and population 𝜎
v for a single raw datum x: 𝒛 =
𝒙 − 𝝁
𝝈
v for n independent and distributed samples(X): 𝒁 =
7
𝒙 − 𝝁
𝝈/ 𝒏
v for proportion 𝒁 =
:
𝒑−𝒑
𝒑(𝟏−𝒑)/𝒏
"
𝑝: 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛, 𝑝: hypothesized population proportion, n: sample size
2. Looking at z-table to map a z-score to the area under a normal distribution curve and return P-value
3. Compare p-value with 𝛼: 𝑖𝑓 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≥ 𝛼: Fail to reject H0 else Reject H0
Sweet AI
Z-test

Sweet AI
Z-test
www.z-table.com

Sweet AI
Student t-test
• Used to test if two groups of data are different from each other and we don’t know standard deviation of population
• Assumption:
• Normal distribution
• Similar variance for each group/sample
• Same number of datapoint in each group/sample (20-30), more than this we should use z-test
• H0: There is no difference between groups
• H1: There is a difference between groups
Types of t-test Description Formula
Degree of
freedom
One-sample t-test Test if a population mean is equal to some value 𝝁
̅
𝑥: sample mean, 𝜇: population mean, s: sample standard deviation, n:
sample size
𝑡 − 𝑣𝑎𝑙𝑢𝑒 =
̅
𝑥 − 𝜇
𝑆
𝑛
df = n -1
Dependent/Paired-
samples t-test
Test whether two population means are equal by sampling the same
population twice , s: sample variances
𝑡 − 𝑣𝑎𝑙𝑢𝑒 =
∑𝑑
𝑛 ∑𝑑 2 − ∑𝑑 2
𝑛 − 1
df = n -1
Independent two-sample
t-test/ unpaired samples
t-test
Test if two population means are equal, two independent samples of
different size with unequal variance
t − 𝑣𝑎𝑙𝑢𝑒 =
𝑠𝑖𝑔𝑛𝑎𝑙
𝑛𝑜𝑖𝑠𝑒
=
̅
𝑥1 − ̅
𝑥2
𝑠1
2
𝑛1
+
𝑠2
2
𝑛2
df = n2 + n1 -2

Sweet AI
Student t-test
stanford.edu
t-value < 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 → Do Not Reject H0
t-value > 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 → Reject H0
Degrees of freedom (df)
1. Calculate t-value and df
2. Determine on one or two tail test and the level
of confidence
3. Look up critical value from t-table and
determine to reject or fail to reject H0

Sweet AI
t-test vs. z-test
Start
Known 𝜎
sample size < 30
Is population highly
skewed?
t - test sign test
sample size >= 30
skewed?
z - test Alternative methods
Not known 𝜎
skewed?
t-test Alternative methods
Yes
No
Yes
No No Yes
• t-test and z-test are used to
determine and compare the
significance of a set of data.

Sweet AI
Analysis Of Variance (ANOVA)
• ANOVA determines the effects of several categorical independent variables on one numerical dependent variable.
ANOVA
One way 1 independent categorical variable on a single dependent variable
Two way 2 independent categorical variables on a single dependent variable
N-way Multiple independent categorical variables

Sweet AI
Analysis Of Variance (ANOVA)
1. Calculate variance between group and within groups
2. Calculate degree of freedom
3. Compute F-value. F − 𝑣𝑎𝑙𝑢𝑒 =
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛𝑔 𝑔𝑟𝑜𝑢𝑝𝑠
=
𝑆𝑆𝐵𝐺/𝑑𝑓1
𝑆𝑆𝑊𝐺/𝑑𝑓2
, df1 = n -1, df2 = (n – 1)m , n: # sample in each group, m: # groups
4. Find critical value/F-score from F Distribution table using df1, df2 and a selected alpha http://www.socr.ucla.edu/Applets.dir/F_Table.html
5. Compare F-value and F-score, if f-value < fcritical : fail to reject H0 else H0 is rejected

Sweet AI
Hypothesis Testing towardsdatascience.com

Sweet AI
Python Library
Type of Test Scipy Code
Determine Gaussian distribution of data from scipy.stats import shapiro/ normaltest
stat, p = shapiro(data) # p > 0.05 has Gaussian distribution
Determine linear relationship of two samples from scipy.stats import pearsonr
stat, p = pearsonr(data1, data2) # p > 0.05 more likely they are independent
Determine monotonic relationship of two samples from scipy.stats import spearmanr/ kendalltau
stat, p = spearmanr(data1, data2) # p > 0.05 more likely they are independent
Determine relationship of two categorical variables from scipy.stats import chi2_contingency
stat, p, dof, expected = chi2_contingency(table) # p > 0.05 more likely they are independent
Determine z-score or percentile from scipy import stats
stats.norm.cdf(z) or stats.norm.ppf(p)
Determine if the means of two independent normally distributed
samples are significantly different (student t-test)
from scipy.stats import ttest_ind
stat, p = ttest_ind(data1, data2) # p > 0.05 more likely the same distribution
Determine if the means of two paired normally distributed
samples are significantly different (student t-test)
from scipy.stats import ttest_rel
stat, p = ttest_rel(data1, data2) # p > 0.05 more likely the same distribution
Determine if the means of two or more independent normally
distributed samples are significantly different (ANOVA)
from scipy.stats import f_oneway
stat, p = f_oneway(data1, data2, data3) # p > 0.05 more likely the same distribution
Determine if the distribution of two independent samples are
equal (Mann- Whitney U test)
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(data1, data2) # p > 0.05 more likely the same distribution

BasicStatistics.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BasicStatistics.pdf

Similar to BasicStatistics.pdf (20)

Recently uploaded

Recently uploaded (20)

BasicStatistics.pdf