4. Sweet AI Basic Concepts
Concept Description
Population The entire dataset that you want to draw conclusions about. e.g., all the schoolβs students of the USA
Sample
A smaller set randomly drawn from the population. e.g., 700 volunteer students from different schools in the USA
Outlier/ Noise/
Anomalies
Datapoints that are at abnormal distance from the other observations, and they can skew the model.
Variate
Univariate Γ one variable
Bivariate Γ two variable
Multivariate Γ more than two variables
Sampling Methods
Probability
Simple
random
Systematic Stratified Cluster
Non-probability
Convenience Snowball Quota Purposive
5. Sweet AI
Statistical Measures
Statistical Measures
Central Tendency
Mean
Median
Mode
Central Dispersion
Range
Variance
Standard Deviation
IQR
Association
Covariance
Correlation
6. Sweet AI
Basic Measurement Concepts
Central
Tendency
Description Example
Mean/ Average
( π, Μ
π₯ )
The total of the numbers divide by the number of numbers.
Sensitive to outlier.
[4, 3, 7, 2, 3, 6]: 4 + 3 + 7 + 2 + 3 + 6 = 25 / 7 Γ 4.16
Median ( Med, M) Sort the numbers and find the middle number [4, 3, 7, 2, 3, 6]: [2, 3, 3, 4, 6, 7] Γ 3.5
Mode The most common occurring number [4, 3, 7, 2, 3, 6]: Γ 3
7. Sweet AI
Central Dispersion
Dispersion Description Example
Range The difference between smallest and largest number [4, 3, 7, 2, 3, 6]: 7 β 2 Γ 5
Variance (π2
) Shows how spread-out the data points are, and measures the width of the
distribution around mean
ππππ’πππ‘πππ: π2 =
β/01
2
(#$ % &)
(
ππππππ: π2 =
β/01
2
(#$ % #)
( %)
[4, 3, 7, 2, 3, 6]: π = 5 Γ dist(-1, -2, 2, -3, -2, 1)2
Γ π2 = 23/6 = 3.83
Standard
Deviation (π)
How spread out the data is around the mean and used to identify outliers.
data points that are more than one sd from mean might be consider
unusual π = π2
[4, 3, 7, 2, 3, 6]: π = 3.83 = 1.95
Standard
Error (SE)
Population Sd estimates how spread-out individual values are from the population mean.
Standard error estimates the accuracy of a sample and how far a sample mean is likely to be from the population mean.
ππΈ Μ
π₯ =
*
+
(π: population standard deviation, n: # datapoints in the sample) Γ return the result as mean Β±ππΈ
8. Sweet AI
Concept Description Example
Quartiles All datapoints are considered and sorted
ascendingly, find median, then find median of two
other sets:
β’ Q1: lower/ first
β’ Q2: median/ middle/ second
β’ Q3: upper/ third
β’ [2, 3, 3, 4, 6, 7, 8, 12, 19, 19, 24, 26]
β’ [ 2,3,3 | 4,6,7 | 8,12,19 | 19,24,26]
Interquartile
Range (IQR)
IQR = Q3 β Q1
Outlier = Q1 β 1.5 x IQR
Outlier = Q3 + 1.5 x IQR
Percentiles 99 values that split the sample into 100 equal size subsamples
Central Dispersion
Q1 Q3
IQR
Whisker
Whisker
Fence at 1.5 IQR
9. Sweet AI
Association
Association Description
Covariance Measures the relationship between two variable in two dimensions.
Positive value Γ two variables move in the same direction
Negative value Γ two variables move in the inverse direction
Closer to zero indicates weak relationship
Farther from 0 indicates stronger relationship
Pearson Correlation
Coefficient/ Pearsonβs r
Measure the strength and direction of a
linear relationship two variables.
-1 (strong negative relationship) < r < +1
(strong positive relationship)
P = 0 Γ no correlation
ππππ’πππ‘πππ: πΆππ£ π, π =
β(ππ β D
π)(ππ β D
π)
π
ππππππ: πΆππ£ π, π =
β(ππ β D
π)(ππ β
D
π)
π β 1
Image from U of Wisconsin.
ππ, π =
πππ£(π, π)
ππππ
=
β(π₯π β Μ
π₯)(π¦π β 1
π¦)
β(π₯π β Μ
π₯)2 β π¦π β 1
π¦ 2
12. Sweet AI
Hypothesis Testing
Hypotheses
Alternative (H1/Ha)
e.g., a male salary is higher than a female salary for a same job position
Null (H0)
e.g., a male salary is equal to a female salary for a same job position
Non-Directional
Directional
Statistical
13. Sweet AI
Hypothesis Testing
Hypothesis Test
Parametric Test
Regression
Simple Linear
Regression
Multiple Linear
Regression
Logistic Regression
Comparison
t-test
ANOVA
MANOVA
Correlation
Pearson's r
Non-Parametric Test
Spearman's r
Chi square test
ANOSIM
Wilcoxon
Sign test
14. Sweet AI
Hypothesis Testing
State H0 & H1 Collect testing samples
Select & Execute
Statistical Test
Infer the results
(Reject/fail to Reject H0)
Ho: Men are, on average,
not getting higher salary
than women.
Ha: Men are, on average,
getting higher salary
than women.
Equal proportion of men
& women in a variety of
industries in scope of a
country
One-tail t-test
Average diff 20k and p-
value 0.002, which is
consistent with H1
15. Terminology Description
Significance level
/confidence
level(πΌ)
A threshold to decide whether a test statistic is statistically significant. Statical significance means high likely a
relationship between variables is not caused by chance. πΌ is lays in the area inside the tail(s) of the H0
πΌ = 1 β (confidence level /100) Γ Common practice πΌ : 0.01, 0.05, 0.1
P-value
(probability
Value)
Determines plausibility of null hypotheses, whether H0 should be rejected or not! P(Sample statistics| H0 True)
0 < p-value < 1
P-value β₯ πΌ : results are not statically significant, H0 not rejected/failed, the null must fly!
P-value < πΌ : results are statically significant, H0 rejected/failed, the null must go!
Sweet AI Basic Concepts
H0 is ... True False
Rejected Type I Error
πΌ
Correct
Not Rejected Correct Type II Error
π½
https://www.abtasty.com
16. β’ Used to test if two groups of data are different from each other and we donβt know standard deviation of population
β’ Normal Distribution Formula:
β’ To calculate percentile of a datapoint we should standardize a Normal Distribution to a Standard Normal Distribution
β’ Standard Normal Distribution has π = 0 & π = 1
β’ How to determine xβs percentile/probability or how far from typical is this result?
1. Standardize the values of normal distribution and calculating z-score by population π and population π
v for a single raw datum x: π =
π β π
π
v for n independent and distributed samples(X): π =
7
π β π
π/ π
v for proportion π =
:
πβπ
π(πβπ)/π
"
π: πππ πππ£ππ π πππππ πππππππ‘πππ, π: hypothesized population proportion, n: sample size
2. Looking at z-table to map a z-score to the area under a normal distribution curve and return P-value
3. Compare p-value with πΌ: ππ π β π£πππ’π β₯ πΌ: Fail to reject H0 else Reject H0
Sweet AI
Z-test
18. Sweet AI
Student t-test
β’ Used to test if two groups of data are different from each other and we donβt know standard deviation of population
β’ Assumption:
β’ Normal distribution
β’ Similar variance for each group/sample
β’ Same number of datapoint in each group/sample (20-30), more than this we should use z-test
β’ H0: There is no difference between groups
β’ H1: There is a difference between groups
Types of t-test Description Formula
Degree of
freedom
One-sample t-test Test if a population mean is equal to some value π
Μ
π₯: sample mean, π: population mean, s: sample standard deviation, n:
sample size
π‘ β π£πππ’π =
Μ
π₯ β π
π
π
df = n -1
Dependent/Paired-
samples t-test
Test whether two population means are equal by sampling the same
population twice , s: sample variances
π‘ β π£πππ’π =
βπ
π βπ 2 β βπ 2
π β 1
df = n -1
Independent two-sample
t-test/ unpaired samples
t-test
Test if two population means are equal, two independent samples of
different size with unequal variance
t β π£πππ’π =
π πππππ
ππππ π
=
Μ
π₯1 β Μ
π₯2
π 1
2
π1
+
π 2
2
π2
df = n2 + n1 -2
19. Sweet AI
Student t-test
stanford.edu
t-value < ππππ‘ππππ π£πππ’π β Do Not Reject H0
t-value > ππππ‘ππππ π£πππ’π β Reject H0
Degrees of freedom (df)
1. Calculate t-value and df
2. Determine on one or two tail test and the level
of confidence
3. Look up critical value from t-table and
determine to reject or fail to reject H0
20. Sweet AI
t-test vs. z-test
Start
Known π
sample size < 30
Is population highly
skewed?
t - test sign test
sample size >= 30
Is population highly
skewed?
z - test Alternative methods
Not known π
Is population highly
skewed?
t-test Alternative methods
Yes
No
Yes
No No Yes
β’ t-test and z-test are used to
determine and compare the
significance of a set of data.
21. Sweet AI
Analysis Of Variance (ANOVA)
β’ ANOVA determines the effects of several categorical independent variables on one numerical dependent variable.
ANOVA
One way 1 independent categorical variable on a single dependent variable
Two way 2 independent categorical variables on a single dependent variable
N-way Multiple independent categorical variables
22. Sweet AI
Analysis Of Variance (ANOVA)
1. Calculate variance between group and within groups
2. Calculate degree of freedom
3. Compute F-value. F β π£πππ’π =
π£πππππππ πππ‘π€πππ ππππ’ππ
π£πππππππ π€ππ‘βπππ ππππ’ππ
=
πππ΅πΊ/ππ1
ππππΊ/ππ2
, df1 = n -1, df2 = (n β 1)m , n: # sample in each group, m: # groups
4. Find critical value/F-score from F Distribution table using df1, df2 and a selected alpha http://www.socr.ucla.edu/Applets.dir/F_Table.html
5. Compare F-value and F-score, if f-value < fcritical : fail to reject H0 else H0 is rejected
26. Sweet AI
Python Library
Type of Test Scipy Code
Determine Gaussian distribution of data from scipy.stats import shapiro/ normaltest
stat, p = shapiro(data) # p > 0.05 has Gaussian distribution
Determine linear relationship of two samples from scipy.stats import pearsonr
stat, p = pearsonr(data1, data2) # p > 0.05 more likely they are independent
Determine monotonic relationship of two samples from scipy.stats import spearmanr/ kendalltau
stat, p = spearmanr(data1, data2) # p > 0.05 more likely they are independent
Determine relationship of two categorical variables from scipy.stats import chi2_contingency
stat, p, dof, expected = chi2_contingency(table) # p > 0.05 more likely they are independent
Determine z-score or percentile from scipy import stats
stats.norm.cdf(z) or stats.norm.ppf(p)
Determine if the means of two independent normally distributed
samples are significantly different (student t-test)
from scipy.stats import ttest_ind
stat, p = ttest_ind(data1, data2) # p > 0.05 more likely the same distribution
Determine if the means of two paired normally distributed
samples are significantly different (student t-test)
from scipy.stats import ttest_rel
stat, p = ttest_rel(data1, data2) # p > 0.05 more likely the same distribution
Determine if the means of two or more independent normally
distributed samples are significantly different (ANOVA)
from scipy.stats import f_oneway
stat, p = f_oneway(data1, data2, data3) # p > 0.05 more likely the same distribution
Determine if the distribution of two independent samples are
equal (Mann- Whitney U test)
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(data1, data2) # p > 0.05 more likely the same distribution