Chi Square Test of Association is used to determine whether there is a statistically significant association between the two categorical variables. This technique is used to determine if the relationship exists between any two business parameters that are of categorical data type.
What is the Chi Square Test of Association and How Can it be Used for Analysis?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
3. Basic Terminologies
• P- value : In case of Chi square test of independence, it indicates whether there is a statistically
significant association between two dimensions i.e. categorical variables
• For different levels of accuracy desired, the p-value can be checked at different thresholds and
inference can be made accordingly
• For instance, for confidence level or accuracy = 95% ( error =5%) , we have to check p-value
against the threshold of 0.05
• If p-value < 0.05 then the association is significant else the association is insignificant
• Similarly, for confidence level =98% (error =2%), we have to check p-value against the threshold
of 0.02
• If p-value < 0.02 then the association is significant else the association is insignificant and so on
4. Introduction
• It is used to determine whether there is a statistically significant
association between the two categorical variables
• Thus it finds out if the relationship exists between any two business
parameters that are of categorical datatype
• Examples :
• We could use a chi-square test for independence to determine whether
gender is related to a voting preference
• We could determine if region has any influence on product category
purchased
5. Example : Input
Let’s conduct the Chi square test of independence on following two
variables, one is a Gender the other is a Product category :
Gender Product Category
M Footwear
F Clothing
F Clothing
F Cosmetics
M Accessories
M Footwear
Dimension containing gender
of a purchaser
Dimension containing product category
purchased
6. Example : Output
Chi square statistics 19.3
P-value 0.041
At 95% confidence level (5% chance of error) :
As p-value = 0.041 which is less than 0.05, there is a statistically significant
association between gender and product category purchased
i.e. Gender has an influence on type of product being purchased
At 98 % confidence level (2% chance of error) :
As p-value = 0.041 which is greater than 0.02, there is no statistically
significant association between gender and product category purchased
i.e. Gender has no influence on type of product being purchased
10. Sample output 3
: Contingency
table
• Contingency table simply
represents counts of each
combination of dimension
values, which is also
represented by circle size in the
plot
11. Sample output 4 : Outliers
Outliers are data values that differ greatly from the majority of a set of data
12. Limitations
• Can be applied on only two categorical variables (two dimensions)
• Number of data points should be at least 50
• Frequency count for each dimension value combination, i.e. each cell
value of the contingency table should be at least 5
• It tells the presence or absence of an association between two
parameters but doesn’t measure the strength of association like
correlation does
13. General applications
FINANCE
MARKETING/MARKETRESEARCH
Helps determine if
certain types of
products sell better
in certain
geographic
locations than
others
Verify if gender has
an influence on
purchase decisions
Identify if there is
an association
between income
level of consumers
and their choice of
brand
If customer age has
an influence on
product/service
subscription(assumi
ng age is converted
into age buckets
such as 18 to 25, 26
– 35 etc.. )
Thus, it helps in
finding a relation
between any
demographic
characteristics of
consumers/researc
h responders such
Identify if demographic
factors impact banking
channel/product/service
preference or selection
of a type of term plan of
an insurance etc.
14. Use case 1
Business benefit:
• Once the test is completed, p-value
is generated which indicates
whether there is significant
association between geography
and brand preference
• Based on this value, a retail store
marketing manager can design the
ongoing marketing campaigns of
different brands for different
geographical customers/prospects
Business problem :
• A retail store marketing manager
want to know if there is a
significant association between
geography of a customer and
his/her brand preference
15. Use case 1 : Input Dataset
Similarly for any category of a product and any
demographic variable, such input dataset can be
generated and Chi square test of independence can
be applied to find out the presence/absence of an
association
State
Most frequently bought brand
(for product category = clothing)
Punjab Vero moda
Gujarat BIBA
Maharashtra Shoppers Stop
Andhra Pradesh Arrow
Tamil Nadu Peter England
Bihar Mafatlal
Madhya Pradesh Tommy Hilfiger
16. Use case 1 : Output
Chi square statistic 25
P-value 0.03
P-value : 0.03 (< 0.05) indicates that there is a significant association
between a geography and brand preference
Hence the upcoming marketing can be done taking into consideration
the geography of consumers, i.e. different brands can be targeted for
different geographies to improve the ROI and in turn revenue
17. Use case 2
Business benefit:
• Once the test is completed, p-value
is generated which indicates
whether there is significant
association between gender and
political party preference
• If significant association is found
then both males and females can
be targeted with different political
campaigns to turn their votes in
preference of a political client
Business problem :
• A marketing researcher wants to
know if gender has an influence on
political party preference
18. Use case 3
Business benefit:
• Once the test is completed, p-value
is generated which indicates
whether there is significant
association between occupation
and product/service preference
• If association is found then
different direct marketing
campaigns can be designed for
different occupations to increases
their response/acquisition rate and
increase the campaign ROI
Business problem :
• Suppose a bank direct marketing
campaign manager wants to know
if consumers/prospects’ occupation
has an influence on type of banking
product/service subscribed
19. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018