This document provides information about various statistical concepts including variables, probability, distributions, hypothesis testing, and Python libraries for statistical analysis. It defines different types of variables, such as continuous, discrete, categorical, and their examples. It also explains concepts like population, sample, central tendency, dispersion, probability, distributions, hypothesis testing, t-test, z-test, ANOVA. Finally, it mentions commonly used Python libraries like SciPy for conducting statistical tests and analysis.
Test of significance (t-test, proportion test, chi-square test)Ramnath Takiar
The presentation discusses the concept of test of significance including the test of significance examples of t-test, proportion test and chi-square test.
Measure of dispersion part II ( Standard Deviation, variance, coefficient of ...Shakehand with Life
This tutorial gives the detailed explanation measure of dispersion part II (standard deviation, properties of standard deviation, variance, and coefficient of variation). It also explains why std. deviation is used widely in place of variance. This tutorial also teaches the MS excel commands of calculation in excel.
Test of significance (t-test, proportion test, chi-square test)Ramnath Takiar
The presentation discusses the concept of test of significance including the test of significance examples of t-test, proportion test and chi-square test.
Measure of dispersion part II ( Standard Deviation, variance, coefficient of ...Shakehand with Life
This tutorial gives the detailed explanation measure of dispersion part II (standard deviation, properties of standard deviation, variance, and coefficient of variation). It also explains why std. deviation is used widely in place of variance. This tutorial also teaches the MS excel commands of calculation in excel.
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and
offering a wide range of dental certified courses in different formats.
Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.In this presentation a brief introduction about SLR and MLR and their codes in R are described
Abstract: This PDSG workshop introduces basic concepts of statistics. Concepts covered are mean (average), median, mode, standard deviation discrete vs. continuous, normal distribution, sampling distribution, Z-scores and boxplots.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This presentation describes the application of regression analysis in research, testing assumptions involved in it and understanding the outputs generated in the analysis.
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and
offering a wide range of dental certified courses in different formats.
Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.In this presentation a brief introduction about SLR and MLR and their codes in R are described
Abstract: This PDSG workshop introduces basic concepts of statistics. Concepts covered are mean (average), median, mode, standard deviation discrete vs. continuous, normal distribution, sampling distribution, Z-scores and boxplots.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This presentation describes the application of regression analysis in research, testing assumptions involved in it and understanding the outputs generated in the analysis.
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxcockekeshia
Week 5 Lecture 14
The Chi Square Test
Quite often, patterns of responses or measures give us a lot of information. Patterns are generally the result of counting how many things fit into a particular category. Whenever we make a histogram, bar, or pie chart we are looking at the pattern of the data. Frequently, changes in these visual patterns will be our first clues that things have changed, and the first clue that we need to initiate a research study (Lind, Marchel, & Wathen, 2008).
One of the most useful test in examining patterns and relationships in data involving counts (how many fit into this category, how many into that, etc.) is the chi-square. It is extremely easy to calculate and has many more uses than we will cover. Examining patterns involves two uses of the Chi-square - the goodness of fit and the contingency table. Both of these uses have a common trait: they involve counts per group. In fact, the chi-square is the only statistic we will look at that we use when we have counts per multiple groups (Tanner & Youssef-Morgan, 2013). Chi Square Goodness of Fit Test
The goodness of fit test checks to see if the data distribution (counts per group) matches some pattern we are interested in. Example: Are the employees in our example company distributed equal across the grades? Or, a more reasonable expectation for a company might be are the employees distributed in a pyramid fashion – most on the bottom and few at the top?
The Chi Square test compares the actual versus a proposed distribution of counts by generating a measure for each cell or count: (actual – expected)2/actual. Summing these for all of the cells or groups provides us with the Chi Square Statistic. As with our other tests, we determine the p-value of getting a result as large or larger to determine if we reject or not reject our null hypothesis. An example will show the approach using Excel.
Regardless of the Chi Square test, the chi square related functions are found in the fx Statistics window rather than the Data Analysis where we found the t and ANOVA test functions. The most important for us are:
· CHISQ.TEST (actual range, expected range) – returns the p-value for the test
· CHISQ.INV.RT(p-value, df) – returns the actual Chi Square value for the p-value or probability value used.
· CHISQ.DIST.RT(X, df) – returns the p-value for a given value.
When we have a table of actual and expected results, using the =CHISQ.TEST(actual range, expected range) will provide us with the p-value of the calculated chi square value (but does not give us the actual calculated chi square value for the test). We can compare this value against our alpha criteria (generally 0.05) to make our decision about rejecting or not rejecting the null hypothesis.
If, after finding the p-value for our chi square test, we want to determine the calculated value of the chi square statistic, we can use the =CHISQ.INV.RT(probability, df) function, the value for probability is .
SPSS does not have Z test for proportions, So, we use Chi-Square test for proportion tests. Test for single proportion and Test for proportions of two samples
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
3. Sweet AI
Probability
Probability
Independent event
Dependent event Conditional probability P(A|B) =
P A∩𝐵
𝑃(𝐵)
Multiplication rule/ Intersection
Depended event:
P A ∩ 𝐵 = 𝑃 𝐴 ∗ 𝑃 𝐵 𝐴 𝑜𝑟 𝑃 𝐵 ∗ 𝑃(𝐴|𝐵)
Indepenedent event:
P A ∩ 𝐵 = 𝑃 𝐴 ∗ 𝑃(𝐵)
Addition rule/ Union P A ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − P A ∩ 𝐵
Complement rule 𝑃(𝐴 ) = 1 − 𝑃(𝐴)
Bayes Theorem P(A|B) =
𝑃 B 𝐴 𝑃(𝐴)
𝑃(𝐵)
Permutation (order matter)
n: number of set, r: number of spots
Repetition nr ex: AB, BA, AA, BB
No repetition
𝑛!
(𝑛−𝑟)!
ex: AB, BA
Combination (order doesn’t matter)
Repetition
(𝑛+𝑟 −1)!
𝑟!(𝑛−1)!
ex: AA, BA, BB
No repetition
𝑛!
𝑟!(𝑛−𝑟)!
ex: AB
4. Sweet AI Basic Concepts
Concept Description
Population The entire dataset that you want to draw conclusions about. e.g., all the school’s students of the USA
Sample
A smaller set randomly drawn from the population. e.g., 700 volunteer students from different schools in the USA
Outlier/ Noise/
Anomalies
Datapoints that are at abnormal distance from the other observations, and they can skew the model.
Variate
Univariate à one variable
Bivariate à two variable
Multivariate à more than two variables
Sampling Methods
Probability
Simple
random
Systematic Stratified Cluster
Non-probability
Convenience Snowball Quota Purposive
5. Sweet AI
Statistical Measures
Statistical Measures
Central Tendency
Mean
Median
Mode
Central Dispersion
Range
Variance
Standard Deviation
IQR
Association
Covariance
Correlation
6. Sweet AI
Basic Measurement Concepts
Central
Tendency
Description Example
Mean/ Average
( 𝜇, ̅
𝑥 )
The total of the numbers divide by the number of numbers.
Sensitive to outlier.
[4, 3, 7, 2, 3, 6]: 4 + 3 + 7 + 2 + 3 + 6 = 25 / 7 à 4.16
Median ( Med, M) Sort the numbers and find the middle number [4, 3, 7, 2, 3, 6]: [2, 3, 3, 4, 6, 7] à 3.5
Mode The most common occurring number [4, 3, 7, 2, 3, 6]: à 3
7. Sweet AI
Central Dispersion
Dispersion Description Example
Range The difference between smallest and largest number [4, 3, 7, 2, 3, 6]: 7 – 2 à 5
Variance (𝜎2
) Shows how spread-out the data points are, and measures the width of the
distribution around mean
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝜎2 =
∑/01
2
(#$ % &)
(
𝑆𝑎𝑚𝑝𝑙𝑒: 𝑆2 =
∑/01
2
(#$ % #)
( %)
[4, 3, 7, 2, 3, 6]: 𝜇 = 5 à dist(-1, -2, 2, -3, -2, 1)2
à 𝜎2 = 23/6 = 3.83
Standard
Deviation (𝜎)
How spread out the data is around the mean and used to identify outliers.
data points that are more than one sd from mean might be consider
unusual 𝜎 = 𝜎2
[4, 3, 7, 2, 3, 6]: 𝜎 = 3.83 = 1.95
Standard
Error (SE)
Population Sd estimates how spread-out individual values are from the population mean.
Standard error estimates the accuracy of a sample and how far a sample mean is likely to be from the population mean.
𝑆𝐸 ̅
𝑥 =
*
+
(𝜎: population standard deviation, n: # datapoints in the sample) à return the result as mean ±𝑆𝐸
8. Sweet AI
Concept Description Example
Quartiles All datapoints are considered and sorted
ascendingly, find median, then find median of two
other sets:
• Q1: lower/ first
• Q2: median/ middle/ second
• Q3: upper/ third
• [2, 3, 3, 4, 6, 7, 8, 12, 19, 19, 24, 26]
• [ 2,3,3 | 4,6,7 | 8,12,19 | 19,24,26]
Interquartile
Range (IQR)
IQR = Q3 – Q1
Outlier = Q1 – 1.5 x IQR
Outlier = Q3 + 1.5 x IQR
Percentiles 99 values that split the sample into 100 equal size subsamples
Central Dispersion
Q1 Q3
IQR
Whisker
Whisker
Fence at 1.5 IQR
9. Sweet AI
Association
Association Description
Covariance Measures the relationship between two variable in two dimensions.
Positive value à two variables move in the same direction
Negative value à two variables move in the inverse direction
Closer to zero indicates weak relationship
Farther from 0 indicates stronger relationship
Pearson Correlation
Coefficient/ Pearson’s r
Measure the strength and direction of a
linear relationship two variables.
-1 (strong negative relationship) < r < +1
(strong positive relationship)
P = 0 à no correlation
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝐶𝑜𝑣 𝑋, 𝑌 =
∑(𝑋𝑖 − D
𝑋)(𝑌𝑖 − D
𝑌)
𝑁
𝑆𝑎𝑚𝑝𝑙𝑒: 𝐶𝑜𝑣 𝑋, 𝑌 =
∑(𝑋𝑖 − D
𝑋)(𝑌𝑖 −
D
𝑌)
𝑁 − 1
Image from U of Wisconsin.
𝜌𝑋, 𝑌 =
𝑐𝑜𝑣(𝑋, 𝑌)
𝜎𝑋𝜎𝑌
=
∑(𝑥𝑖 − ̅
𝑥)(𝑦𝑖 − 1
𝑦)
∑(𝑥𝑖 − ̅
𝑥)2 ∑ 𝑦𝑖 − 1
𝑦 2
12. Sweet AI
Hypothesis Testing
Hypotheses
Alternative (H1/Ha)
e.g., a male salary is higher than a female salary for a same job position
Null (H0)
e.g., a male salary is equal to a female salary for a same job position
Non-Directional
Directional
Statistical
13. Sweet AI
Hypothesis Testing
Hypothesis Test
Parametric Test
Regression
Simple Linear
Regression
Multiple Linear
Regression
Logistic Regression
Comparison
t-test
ANOVA
MANOVA
Correlation
Pearson's r
Non-Parametric Test
Spearman's r
Chi square test
ANOSIM
Wilcoxon
Sign test
14. Sweet AI
Hypothesis Testing
State H0 & H1 Collect testing samples
Select & Execute
Statistical Test
Infer the results
(Reject/fail to Reject H0)
Ho: Men are, on average,
not getting higher salary
than women.
Ha: Men are, on average,
getting higher salary
than women.
Equal proportion of men
& women in a variety of
industries in scope of a
country
One-tail t-test
Average diff 20k and p-
value 0.002, which is
consistent with H1
15. Terminology Description
Significance level
/confidence
level(𝛼)
A threshold to decide whether a test statistic is statistically significant. Statical significance means high likely a
relationship between variables is not caused by chance. 𝛼 is lays in the area inside the tail(s) of the H0
𝛼 = 1 – (confidence level /100) à Common practice 𝛼 : 0.01, 0.05, 0.1
P-value
(probability
Value)
Determines plausibility of null hypotheses, whether H0 should be rejected or not! P(Sample statistics| H0 True)
0 < p-value < 1
P-value ≥ 𝛼 : results are not statically significant, H0 not rejected/failed, the null must fly!
P-value < 𝛼 : results are statically significant, H0 rejected/failed, the null must go!
Sweet AI Basic Concepts
H0 is ... True False
Rejected Type I Error
𝛼
Correct
Not Rejected Correct Type II Error
𝛽
https://www.abtasty.com
16. • Used to test if two groups of data are different from each other and we don’t know standard deviation of population
• Normal Distribution Formula:
• To calculate percentile of a datapoint we should standardize a Normal Distribution to a Standard Normal Distribution
• Standard Normal Distribution has 𝜇 = 0 & 𝜎 = 1
• How to determine x’s percentile/probability or how far from typical is this result?
1. Standardize the values of normal distribution and calculating z-score by population 𝜇 and population 𝜎
v for a single raw datum x: 𝒛 =
𝒙 − 𝝁
𝝈
v for n independent and distributed samples(X): 𝒁 =
7
𝒙 − 𝝁
𝝈/ 𝒏
v for proportion 𝒁 =
:
𝒑−𝒑
𝒑(𝟏−𝒑)/𝒏
"
𝑝: 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛, 𝑝: hypothesized population proportion, n: sample size
2. Looking at z-table to map a z-score to the area under a normal distribution curve and return P-value
3. Compare p-value with 𝛼: 𝑖𝑓 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≥ 𝛼: Fail to reject H0 else Reject H0
Sweet AI
Z-test
18. Sweet AI
Student t-test
• Used to test if two groups of data are different from each other and we don’t know standard deviation of population
• Assumption:
• Normal distribution
• Similar variance for each group/sample
• Same number of datapoint in each group/sample (20-30), more than this we should use z-test
• H0: There is no difference between groups
• H1: There is a difference between groups
Types of t-test Description Formula
Degree of
freedom
One-sample t-test Test if a population mean is equal to some value 𝝁
̅
𝑥: sample mean, 𝜇: population mean, s: sample standard deviation, n:
sample size
𝑡 − 𝑣𝑎𝑙𝑢𝑒 =
̅
𝑥 − 𝜇
𝑆
𝑛
df = n -1
Dependent/Paired-
samples t-test
Test whether two population means are equal by sampling the same
population twice , s: sample variances
𝑡 − 𝑣𝑎𝑙𝑢𝑒 =
∑𝑑
𝑛 ∑𝑑 2 − ∑𝑑 2
𝑛 − 1
df = n -1
Independent two-sample
t-test/ unpaired samples
t-test
Test if two population means are equal, two independent samples of
different size with unequal variance
t − 𝑣𝑎𝑙𝑢𝑒 =
𝑠𝑖𝑔𝑛𝑎𝑙
𝑛𝑜𝑖𝑠𝑒
=
̅
𝑥1 − ̅
𝑥2
𝑠1
2
𝑛1
+
𝑠2
2
𝑛2
df = n2 + n1 -2
19. Sweet AI
Student t-test
stanford.edu
t-value < 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 → Do Not Reject H0
t-value > 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 → Reject H0
Degrees of freedom (df)
1. Calculate t-value and df
2. Determine on one or two tail test and the level
of confidence
3. Look up critical value from t-table and
determine to reject or fail to reject H0
20. Sweet AI
t-test vs. z-test
Start
Known 𝜎
sample size < 30
Is population highly
skewed?
t - test sign test
sample size >= 30
Is population highly
skewed?
z - test Alternative methods
Not known 𝜎
Is population highly
skewed?
t-test Alternative methods
Yes
No
Yes
No No Yes
• t-test and z-test are used to
determine and compare the
significance of a set of data.
21. Sweet AI
Analysis Of Variance (ANOVA)
• ANOVA determines the effects of several categorical independent variables on one numerical dependent variable.
ANOVA
One way 1 independent categorical variable on a single dependent variable
Two way 2 independent categorical variables on a single dependent variable
N-way Multiple independent categorical variables
22. Sweet AI
Analysis Of Variance (ANOVA)
1. Calculate variance between group and within groups
2. Calculate degree of freedom
3. Compute F-value. F − 𝑣𝑎𝑙𝑢𝑒 =
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛𝑔 𝑔𝑟𝑜𝑢𝑝𝑠
=
𝑆𝑆𝐵𝐺/𝑑𝑓1
𝑆𝑆𝑊𝐺/𝑑𝑓2
, df1 = n -1, df2 = (n – 1)m , n: # sample in each group, m: # groups
4. Find critical value/F-score from F Distribution table using df1, df2 and a selected alpha http://www.socr.ucla.edu/Applets.dir/F_Table.html
5. Compare F-value and F-score, if f-value < fcritical : fail to reject H0 else H0 is rejected
26. Sweet AI
Python Library
Type of Test Scipy Code
Determine Gaussian distribution of data from scipy.stats import shapiro/ normaltest
stat, p = shapiro(data) # p > 0.05 has Gaussian distribution
Determine linear relationship of two samples from scipy.stats import pearsonr
stat, p = pearsonr(data1, data2) # p > 0.05 more likely they are independent
Determine monotonic relationship of two samples from scipy.stats import spearmanr/ kendalltau
stat, p = spearmanr(data1, data2) # p > 0.05 more likely they are independent
Determine relationship of two categorical variables from scipy.stats import chi2_contingency
stat, p, dof, expected = chi2_contingency(table) # p > 0.05 more likely they are independent
Determine z-score or percentile from scipy import stats
stats.norm.cdf(z) or stats.norm.ppf(p)
Determine if the means of two independent normally distributed
samples are significantly different (student t-test)
from scipy.stats import ttest_ind
stat, p = ttest_ind(data1, data2) # p > 0.05 more likely the same distribution
Determine if the means of two paired normally distributed
samples are significantly different (student t-test)
from scipy.stats import ttest_rel
stat, p = ttest_rel(data1, data2) # p > 0.05 more likely the same distribution
Determine if the means of two or more independent normally
distributed samples are significantly different (ANOVA)
from scipy.stats import f_oneway
stat, p = f_oneway(data1, data2, data3) # p > 0.05 more likely the same distribution
Determine if the distribution of two independent samples are
equal (Mann- Whitney U test)
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(data1, data2) # p > 0.05 more likely the same distribution