SlideShare a Scribd company logo
1 of 49
Download to read offline
By Huan Yang
Final
CONTENTS
05
04
03
02
01
Appendix
Regression
Research design and The data
Learning outcomes
Self-Introduction
Huan(Joyce) Yang comes from Shenzhen. She graduated from Shenzhen
University, majored in Business Administration. Her classes’ syllabus is
developed based on teaching materials from Wharton School at the
University of Pennsylvania. This helps her get access to and seek
improvement from world-class business education resources. Then she
further equipped herself with a master's degree in Professional Accounting
and Corporate Governance at City University of Hong Kong. During her study,
she built up a solid foundation in many of the key fields in marketing:
Marketing, Mobile Application UX Design and Marketing Strategies for Mobile
Internet Products, and so on. She found that she was very interested in
marketing and also knew that Marketing is one of the most important parts of
a company‘s daily operation. so she chose <Integrated Marketing > Program
at NYU to try her best to learn more about marketing.
•Your Github Repo Link: https://github.com/HuanJoyce
•Your Kaggle Notebook Link(s): https://www.kaggle.com/huanjoyceyang
•Your LinkedIn URL:https://www.linkedin.com/in/欢-杨-a2b8141b7/
Self- Inroduction
Learning outcomes
Understand and master the current popular
updated data analysis tools,such as
Github Repo.
Provides us with quantitative and
qualitative techniques
Learned how to use the statistic analysis
in marketing
2
1
3
Capstone Project Milestone 2: Research Design and The Data
Abstract:Because the price of a house is often the primary criterion for attracting customers to buy, the
appropriateness of the house’s pricing strategy is critical to the sale of real estate developers. So to help real
estate developers to understand the market and predict the house price, I researched this dataset, which
contains house sale prices for King County. It includes homes sold between May 2014 and May 2015. This
study will use linear regression as a data analysis tool to study the relationship between price and sqft_living
and the relationship between price and sqft_above.
Executive Summary
•Summary of the data sources, the regression model you choose, and the result
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold
between May 2014 and May 2015.
It's a great dataset for evaluating simple regression models.I choose linear regression model. The regression
equation is Yi=β 0+β1Xi+β2Xii +ei.The result is sqft_living and price are correlated, and sqft_above and price
are correlated.
• The URL to my Github repo that stored the Jupyter Notebook associated with this project.
https://github.com/HuanJoyce/NYU_Integrated_Marketing/blob/main/House_Sales_in_King_County%2C_US
A_Huan_Yang.ipynb
• The URL to the data sources. (You are free to choose another dataset of your own interest)
https://www.kaggle.com/harlfoxem/housesalesprediction
Scatter plots
Summary of the plots: From the plots, we can conclude that sqft_living and price have
obvious linear relationship, and sqft_above and price have obvious linear relationship.
Regression Results
Yi=β 0+β1Xi+β2Xii +ei.
H0:β1=0,β2=0
Xi=sqft_living
Xii=sqft_above
Yi=price
Summary for the results:
Both p-value for Xi and Xii<0.05, So
we can reject the hypothesis that
sqft_living and price are not correlated
Besides, we also can not reject the
hypothesis that sqft_above and price
are not correlated.
Insights gained from the regression
From the regression, we can conclude that sqft_living and sqft_above are
correlated. Besides, we also can not reject the hypothesis that sqft_above
and price are not correlated. So for making marketing decisions, we can
predict the future house price depending on sqft_living and sqft_above. If
the real estate developers use a wiser pricing strategy, the developers will
attract more consumers and have a higher income.
Assumptions Check
Assumptions
1. The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal. Not satisfied
2. The means of all these normal distributions of Y, given X, lie on a straight
line with slope b. Satisfied
3. The mean of the error term is 0. Satisfied
4. The variance of the error term is constant. This variance does not depend
on the values assumed by X. Not satisfied
5. The error terms are uncorrelated. In other words, the observations have
been drawn independently. Omit
6. The independent variables in X are not correlated. This is no issues of
multi-collinearity. Not satisfied
Assumptions Check
Summary of the results:
From the plots, we can conclude that
Assumption 4 is not satisfied. The
variance of the error term is not constant.
This variance depends on the values
assumed by X. From the previous scatter
plots, we can also conclude that
Assumption 2 is satisfied. The means of
all these normal distributions of Y, given X,
lie on a straight line with slope b.Because
this data is not time-series data, so we
can omit Assumption5 in this case.
Assumptions Check
Summary of the results:
P-value=0.0<0.05
So we can reject the hypothesis that with
zero mean, the residual is normally
distributed. So Assumption 1 is not
satisfied.
Assumption 3 is satisfied. The mean of
the error term is 0.
Assumptions Check
Summary of the results:
P-value=0.0<0.05
So we can reject the hypothesis that sqft_living and sqft_above are not
correlated. So Assumption 6 is not satisfied.
13www.islide.cc
Further Research
Because some assumptions are not satisfied, so I think we need to
consider changing to other data analysis tools, like generalized regression
in future research. More comprehensive research can provide a better
reference for marketing decisions and make future marketing strategies
more targeted and effective. Besides, if we have enough budget, we can
test more factors, that may influence the price, to improve the accuracy of
our research.
By Huan Yang
Appendix
Capstone Project Milestone 2:
Research Design and the Data
By Huan Yang
Capstone Project Milestone 3-Hypothesis Testing
Summary of the data sources:
The first one data is related with direct marketing campagins of a Portuguese banking institution.The
second one data is related with Economy before and after COVID-19 in many countries.
The hypothesis you choose and the result :
For the analysis of the first data, I assume that The mean of balance equals between those who have loan
and those who do not have loan . The result is the mean of balance not equals between those who have
loan and those who do not have loan.
For the analysis of the second data, I assume that the same country’s GDP are correlated in 2018Q2 and
2020Q2 . The result is the same country’s GDP are not correlated in 2018Q2 and 2020Q2. What’s more, I
also assume that the mean of GDP in 2018Q2 equal to GDP in 2020Q2. The result is the mean difference
of 2018Q2 and 2020Q2 is significantly different from 0.
The url to my Github repo :
https://github.com/HuanJoyce/NYU_Integrated_Marketing
The url to the data sources:
https://data.world/data-society/bank-marketing-data
https://stats.oecd.org/index.aspx?queryid=33940#
Why this test:
Because these data is matric data, and not
peried, two group, so use two-sample T-test.
Hypothsis:
The mean of balance equals between those
who have loan and those who do not has loan
Conclusion:
p-value=2.764056e-137< 0.05
We can reject the null hypothesis that the
mean of balance equals between those who
have house load and those who do not has
house load at 0.05 (or even 0.001) significant
level.
Two-Sample T-test
Test for multivariate normality
Why this test:
We need do the choice of using spearman
correlation or person correlation based on this
test.
Hypothsis:
This data is multivariate normality.
Conclusion:
p-value=0.1093> 0.05
We can not reject the null hypothesis that this
data is multivariate normality.
at 0.05 (or even 0.001) significant level.
Perason correlation
Why this test:
Because these data is not outlier and normally
distributed, and we want to test the correlation
of GDP in 2018Q2 and 2020Q2, so we need to
use Perason correlation .
Hypothsis:
The same country’s GDP are correlated in
2018Q2 and 2020Q2.
Conclusion:
p-value=0.0< 0.05
We can reject the null hypothesis that the same
country’s GDP are correlated in 2018Q2 and
2020Q2.at 0.05 (or even 0.001) significant
level. It means there is no correlation between
GDP in 2018Q 2 and 2020Q2.
Why this test:
Because these data is matric data, and we want
to compare the difference of the same group’s
mean in two different period.
Hypothsis:
The mean of GDP in 2018Q2 is equal to GDP
in 2020Q2.
Conclusion:
p-value=0.0 < 0.05
We can reject the null hypothesis that the
mean of GDP in 2018Q2 equal to GDP in
2020Q2 at 0.05 (or even 0.001) significant
level. The mean difference of 2018Q2 and
2020Q2 is significantly different from 0.
Paired T-test
Conclusion:For a 0.2 cohen d effect size, a power of 0.80, and a type I error of 0.05, we need a sample size of
393 (for each group).
Power Analysis
The limitations of the current result
There is some limitations of the current result. The data related with direct marketing campagins of a
Portuguese banking institution was cllected only one time. Only the data after the campagin was
collected, but the data before the campagin was not collected. So it is hard for us to the Paried T-test to
analysis the campagin’ s effectiveness. And this situation will influnce us make valuable business
decisions, bcecuase we can not know whether this campagin is really effective or not.
Plan of future research
For the limitations, I need to do further research and collect new data for more comprehensive analysis.If
we want to analyze the effectivenss of the campagin more clearly by the comparison and also to assure
for a small Type II error and keep cohen d= 0.2, the new sample size will need to be 393 and need to be
collect two time, including before and after the campagin.
By Huan Yang
Capstone Project Milestone 4: Regression
Executive Summary
•Summary of the data sources, the regression model you choose, and the result
This data is about predicting whether a customer will change telecommunications provider, something known
as "churning".The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean
variable "churn" which indicates the class of the sample.
I choose linear regression model. The regression equation is Yi=β 0+β1Xi+β2Xii +ei.The result is total night
minutes and total day minutes are not correlated and total night minutes and total evening minutes are not
correlated.
• The URL to your Github repo that stored the Jupyter Notebook associated with this
project.
https://github.com/HuanJoyce/NYU_Integrated_Marketing/blob/main/“Huan_Yang_Predicting_Customer_Chu
rn_ipynb”.ipynb
• The URL to the data sources. (You are free to choose another dataset of your own interest)
https://www.kaggle.com/c/customer-churn-prediction-2020
Scatter plots
Summary of the plots: From the plots, we can conclude that total day minutes and total
night minutes have obvious linear relationship, and total evening minutes and total night
minutes have obvious linear relationship.
Regression Results
Yi=β 0+β1Xi+β2Xii +ei.
H0:β1=0,β2=0
Xi=total day minutes
Xii=total eve minutes
Yi=total night minutes
Summary for the results:
Both p-value for Xi and Xii>0.05,
So we can not reject the
hypothesis that total night
minutes and total day minutes
are not correlated. Besides, we
also can not reject the
hypothesis that total night
minutes and total evening
minutes are not correlated.
Insights gained from the regression
From the regression, we can conclude that total night minutes and total
day minutes are not correlated and total night minutes and total evening
minutes are not correlated. So for making marketing decisions, we do not
need to worry about the changes in total day minutes or total evening
minutes will influnce the total night minutes. So we can do some promotion,
like design package of day and evening minutes, to attract customers
make phone call in longer time and also to keep customers not to change
providers.
Assumptions Check
Assumptions
1. The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal.
2. The means of all these normal distributions of Y, given X, lie on a straight
line with slope b.
3. The mean of the error term is 0.
4. The variance of the error term is constant. This variance does not depend
on the values assumed by X.
5. The error terms are uncorrelated. In other words, the observations have
been drawn independently.
6. The independent variables in X are not correlated. This is no issues of
multi-collinearity.
Assumptions Check
Summary of the results:
From the plots, we can conclude that
residuals and predictions have linear
relationship, So Assumption 4 is
satisfied. The variance of the error term
is constant. This variance does not
depend on the values assumed by X.
From the previous scatter plots, we can
also conclude that Assumption 2 is
satisfied.The means of all these normal
distributions of Y, given X, lie on a straight
line with slope b.Because this data is not
time series data, so we can omit
Assumption5 in this case.
Assumptions Check
Summary of the results:
P-value=0.77318>0.05
So we can not reject the hypothesis that
with zero mean, the residual is normally
distributed. So Assumption 1 and
Assumption 3 are satisfied.The error
term is normally distributed. For each
fixed value of X, the distribution of Y is
normal.The mean of the error term is 0.
Assumptions Check
Summary of the results:
P-value=0.388>0.05
So we can not reject the hypothesis that total day minutes and total eve minutes
are not correlated. So Assumption 6 is satisfied.The independent variables in X
are not correlated. There is no issues of multi-collinearity
33www.islide.cc
Further Research
Although our existing data includes 19features, I think we also need to
consider more influencing factors in future research, such as customer
age and income. A more comprehensive research can provide a
better reference for marketing decision and make future marketing
strategy more targeted and effective. Besides, if we have enough
budget, we can test a larger sample size to improve the accuracy of
our research.
By Huan Yang
Capstone Project Milestone 5: Customer Segmentation
Executive Summary
•Summary of the data sources, the statistic method you choose, and the result
This data is a transnational data set which contains all the transactions occurring between 01/12/2010 and
09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-
occasion gifts. Many customers of the company are wholesalers.We will be using the online retail trasnational
dataset to build a RFM clustering and choose the best set of customers which the company should target.
I choose K- Means Clustering and Hierarchical Clustering.The result is the best K=3. Besides, k-Means
Clustering returns 57 target customer , while Hierarchical Clustering returns 2 target customers , which is a
much smaller group than the one that K- Means Clustering return.
• The URL to my Kaggel Notebook
https://www.kaggle.com/huanjoyceyang/customer-segementation-hy2208
• The URL to the data sources.
https://www.kaggle.com/hellbuoy/online-retail-customer-clustering
Background
This analysis is based on France.
K-Means Clustering
When metric = “distortion”: computes the sum
of squared distances from each point to its
assigned center
K-Means Clustering
When metric = “silhouette”: calculates the mean
Silhouette Coefficient of all samples.
K-Means Clustering
When metric = “calinski_harabasz”:
computes the ratio of dispersion between and
within clusters
K-Means Clustering
Beacuse we can not get the result of best K when we set metric = “calinski_harabasz”,
so we need to find the best K based on the second one, which set metric = “silhouette”.
So, the result is the best K=3.
K-Means Clustering
By the RFM criteria, we should choose the customer clusters with a lower
recency, a higher frequency and amount. From the K-means clustering results,
we can see that see that customers with Cluster_Id=2 best fit the criteria.
K-Means Clustering
We can see that we k-Means Clustering
returns 57 target customer .
Hierarchical Clustering
Visualize Tree by Three Linkage Methods
Hierarchical Clustering
Perform Clustering by choosing best k=3
Hierarchical Clustering
By the RFM criteria, we should choose the customer clusters with a lower recency, a higher
frequency and amount. From the Hierarchical clustering results, we can see that customers with
Cluster_Labels=2 has a higher frecrency and customers with Cluster_Labels=1 has a higher
amount.So when we pick target customers, we need base on our company’s demand.
Hierarchical Clustering
When our company pay more attention to
frequency, we choose Cluster_Id=2 to fit the
criteria.
We can see that Hierarchical Clustering
returns 2 target customers , which is a
much smaller group than the one that K-
Means Clustering return.
Hierarchical Clustering
When our company pay more attention to
frequency, we choose Cluster_Id=1 to fit the
criteria.
We can see that Hierarchical Clustering still
returns 2 target customers , which is a
much smaller group than the one that K-
Means Clustering return.
That’s all. Thank you!

More Related Content

What's hot

Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingGalit Shmueli
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.Nahian Ahmed
 
House Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning AlgorithmHouse Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning Algorithmijtsrd
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
The data science revolution in insurance
The data science revolution in insuranceThe data science revolution in insurance
The data science revolution in insuranceStefano Perfetti
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industryStefano Perfetti
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group ProjectErik Bebernes
 
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSISSTOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSISijcsit
 
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market InvestmentIRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market InvestmentIRJET Journal
 

What's hot (10)

Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and Predicting
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.
 
House Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning AlgorithmHouse Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning Algorithm
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
The data science revolution in insurance
The data science revolution in insuranceThe data science revolution in insurance
The data science revolution in insurance
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSISSTOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
 
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market InvestmentIRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
 

Similar to Hy2208 final

wt2084 final presentation slides
wt2084 final presentation slideswt2084 final presentation slides
wt2084 final presentation slidesWeixiTan
 
Final presentation zg2088
Final presentation zg2088Final presentation zg2088
Final presentation zg2088ssuserd6504f
 
Final Presentation Slide--yw5244
Final Presentation Slide--yw5244Final Presentation Slide--yw5244
Final Presentation Slide--yw5244ssuserdb31951
 
Final Presentation
Final PresentationFinal Presentation
Final Presentationssuseraf9eb5
 
statistical measurement project presentation
statistical measurement project presentationstatistical measurement project presentation
statistical measurement project presentationKexinZhang22
 
Yx2489 final presentation slides
Yx2489 final presentation slidesYx2489 final presentation slides
Yx2489 final presentation slidesYiXu86
 
statistical measurement project present
statistical measurement project presentstatistical measurement project present
statistical measurement project presentKexinZhang22
 
statistical measurement project present
statistical measurement project presentstatistical measurement project present
statistical measurement project presentKexinZhang22
 
statistical measurement project presentation
statistical measurement project presentationstatistical measurement project presentation
statistical measurement project presentationKexinZhang22
 
IntroductionThe TJF Company is an organization that is beginni.docx
IntroductionThe TJF Company is an organization that is beginni.docxIntroductionThe TJF Company is an organization that is beginni.docx
IntroductionThe TJF Company is an organization that is beginni.docxmariuse18nolet
 
Social media monitoring
Social media monitoringSocial media monitoring
Social media monitoringNidhiArora113
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 
Essay On Stamford International Inc
Essay On Stamford International IncEssay On Stamford International Inc
Essay On Stamford International IncDeborah Gastineau
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
 
Predictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation pricePredictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation pricekahhuey
 

Similar to Hy2208 final (20)

wt2084 final presentation slides
wt2084 final presentation slideswt2084 final presentation slides
wt2084 final presentation slides
 
Final presentation zg2088
Final presentation zg2088Final presentation zg2088
Final presentation zg2088
 
Yg2298
Yg2298Yg2298
Yg2298
 
Final Presentation Slide--yw5244
Final Presentation Slide--yw5244Final Presentation Slide--yw5244
Final Presentation Slide--yw5244
 
Final Presentation
Final PresentationFinal Presentation
Final Presentation
 
statistical measurement project presentation
statistical measurement project presentationstatistical measurement project presentation
statistical measurement project presentation
 
Yx2489 final presentation slides
Yx2489 final presentation slidesYx2489 final presentation slides
Yx2489 final presentation slides
 
statistical measurement project present
statistical measurement project presentstatistical measurement project present
statistical measurement project present
 
statistical measurement project present
statistical measurement project presentstatistical measurement project present
statistical measurement project present
 
statistical measurement project presentation
statistical measurement project presentationstatistical measurement project presentation
statistical measurement project presentation
 
FICO Credit Risk Data
FICO Credit Risk DataFICO Credit Risk Data
FICO Credit Risk Data
 
FICO Credit Risk Data
FICO Credit Risk DataFICO Credit Risk Data
FICO Credit Risk Data
 
IntroductionThe TJF Company is an organization that is beginni.docx
IntroductionThe TJF Company is an organization that is beginni.docxIntroductionThe TJF Company is an organization that is beginni.docx
IntroductionThe TJF Company is an organization that is beginni.docx
 
Social media monitoring
Social media monitoringSocial media monitoring
Social media monitoring
 
Rmda
RmdaRmda
Rmda
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
Econometrics
EconometricsEconometrics
Econometrics
 
Essay On Stamford International Inc
Essay On Stamford International IncEssay On Stamford International Inc
Essay On Stamford International Inc
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
 
Predictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation pricePredictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation price
 

Recently uploaded

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Recently uploaded (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Hy2208 final

  • 2. CONTENTS 05 04 03 02 01 Appendix Regression Research design and The data Learning outcomes Self-Introduction
  • 3. Huan(Joyce) Yang comes from Shenzhen. She graduated from Shenzhen University, majored in Business Administration. Her classes’ syllabus is developed based on teaching materials from Wharton School at the University of Pennsylvania. This helps her get access to and seek improvement from world-class business education resources. Then she further equipped herself with a master's degree in Professional Accounting and Corporate Governance at City University of Hong Kong. During her study, she built up a solid foundation in many of the key fields in marketing: Marketing, Mobile Application UX Design and Marketing Strategies for Mobile Internet Products, and so on. She found that she was very interested in marketing and also knew that Marketing is one of the most important parts of a company‘s daily operation. so she chose <Integrated Marketing > Program at NYU to try her best to learn more about marketing. •Your Github Repo Link: https://github.com/HuanJoyce •Your Kaggle Notebook Link(s): https://www.kaggle.com/huanjoyceyang •Your LinkedIn URL:https://www.linkedin.com/in/欢-杨-a2b8141b7/ Self- Inroduction
  • 4. Learning outcomes Understand and master the current popular updated data analysis tools,such as Github Repo. Provides us with quantitative and qualitative techniques Learned how to use the statistic analysis in marketing 2 1 3
  • 5. Capstone Project Milestone 2: Research Design and The Data Abstract:Because the price of a house is often the primary criterion for attracting customers to buy, the appropriateness of the house’s pricing strategy is critical to the sale of real estate developers. So to help real estate developers to understand the market and predict the house price, I researched this dataset, which contains house sale prices for King County. It includes homes sold between May 2014 and May 2015. This study will use linear regression as a data analysis tool to study the relationship between price and sqft_living and the relationship between price and sqft_above.
  • 6. Executive Summary •Summary of the data sources, the regression model you choose, and the result This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It's a great dataset for evaluating simple regression models.I choose linear regression model. The regression equation is Yi=β 0+β1Xi+β2Xii +ei.The result is sqft_living and price are correlated, and sqft_above and price are correlated. • The URL to my Github repo that stored the Jupyter Notebook associated with this project. https://github.com/HuanJoyce/NYU_Integrated_Marketing/blob/main/House_Sales_in_King_County%2C_US A_Huan_Yang.ipynb • The URL to the data sources. (You are free to choose another dataset of your own interest) https://www.kaggle.com/harlfoxem/housesalesprediction
  • 7. Scatter plots Summary of the plots: From the plots, we can conclude that sqft_living and price have obvious linear relationship, and sqft_above and price have obvious linear relationship.
  • 8. Regression Results Yi=β 0+β1Xi+β2Xii +ei. H0:β1=0,β2=0 Xi=sqft_living Xii=sqft_above Yi=price Summary for the results: Both p-value for Xi and Xii<0.05, So we can reject the hypothesis that sqft_living and price are not correlated Besides, we also can not reject the hypothesis that sqft_above and price are not correlated.
  • 9. Insights gained from the regression From the regression, we can conclude that sqft_living and sqft_above are correlated. Besides, we also can not reject the hypothesis that sqft_above and price are not correlated. So for making marketing decisions, we can predict the future house price depending on sqft_living and sqft_above. If the real estate developers use a wiser pricing strategy, the developers will attract more consumers and have a higher income.
  • 10. Assumptions Check Assumptions 1. The error term is normally distributed. For each fixed value of X, the distribution of Y is normal. Not satisfied 2. The means of all these normal distributions of Y, given X, lie on a straight line with slope b. Satisfied 3. The mean of the error term is 0. Satisfied 4. The variance of the error term is constant. This variance does not depend on the values assumed by X. Not satisfied 5. The error terms are uncorrelated. In other words, the observations have been drawn independently. Omit 6. The independent variables in X are not correlated. This is no issues of multi-collinearity. Not satisfied
  • 11. Assumptions Check Summary of the results: From the plots, we can conclude that Assumption 4 is not satisfied. The variance of the error term is not constant. This variance depends on the values assumed by X. From the previous scatter plots, we can also conclude that Assumption 2 is satisfied. The means of all these normal distributions of Y, given X, lie on a straight line with slope b.Because this data is not time-series data, so we can omit Assumption5 in this case.
  • 12. Assumptions Check Summary of the results: P-value=0.0<0.05 So we can reject the hypothesis that with zero mean, the residual is normally distributed. So Assumption 1 is not satisfied. Assumption 3 is satisfied. The mean of the error term is 0.
  • 13. Assumptions Check Summary of the results: P-value=0.0<0.05 So we can reject the hypothesis that sqft_living and sqft_above are not correlated. So Assumption 6 is not satisfied. 13www.islide.cc
  • 14. Further Research Because some assumptions are not satisfied, so I think we need to consider changing to other data analysis tools, like generalized regression in future research. More comprehensive research can provide a better reference for marketing decisions and make future marketing strategies more targeted and effective. Besides, if we have enough budget, we can test more factors, that may influence the price, to improve the accuracy of our research.
  • 16. Capstone Project Milestone 2: Research Design and the Data
  • 17. By Huan Yang Capstone Project Milestone 3-Hypothesis Testing
  • 18. Summary of the data sources: The first one data is related with direct marketing campagins of a Portuguese banking institution.The second one data is related with Economy before and after COVID-19 in many countries. The hypothesis you choose and the result : For the analysis of the first data, I assume that The mean of balance equals between those who have loan and those who do not have loan . The result is the mean of balance not equals between those who have loan and those who do not have loan. For the analysis of the second data, I assume that the same country’s GDP are correlated in 2018Q2 and 2020Q2 . The result is the same country’s GDP are not correlated in 2018Q2 and 2020Q2. What’s more, I also assume that the mean of GDP in 2018Q2 equal to GDP in 2020Q2. The result is the mean difference of 2018Q2 and 2020Q2 is significantly different from 0. The url to my Github repo : https://github.com/HuanJoyce/NYU_Integrated_Marketing The url to the data sources: https://data.world/data-society/bank-marketing-data https://stats.oecd.org/index.aspx?queryid=33940#
  • 19. Why this test: Because these data is matric data, and not peried, two group, so use two-sample T-test. Hypothsis: The mean of balance equals between those who have loan and those who do not has loan Conclusion: p-value=2.764056e-137< 0.05 We can reject the null hypothesis that the mean of balance equals between those who have house load and those who do not has house load at 0.05 (or even 0.001) significant level. Two-Sample T-test
  • 20. Test for multivariate normality Why this test: We need do the choice of using spearman correlation or person correlation based on this test. Hypothsis: This data is multivariate normality. Conclusion: p-value=0.1093> 0.05 We can not reject the null hypothesis that this data is multivariate normality. at 0.05 (or even 0.001) significant level.
  • 21. Perason correlation Why this test: Because these data is not outlier and normally distributed, and we want to test the correlation of GDP in 2018Q2 and 2020Q2, so we need to use Perason correlation . Hypothsis: The same country’s GDP are correlated in 2018Q2 and 2020Q2. Conclusion: p-value=0.0< 0.05 We can reject the null hypothesis that the same country’s GDP are correlated in 2018Q2 and 2020Q2.at 0.05 (or even 0.001) significant level. It means there is no correlation between GDP in 2018Q 2 and 2020Q2.
  • 22. Why this test: Because these data is matric data, and we want to compare the difference of the same group’s mean in two different period. Hypothsis: The mean of GDP in 2018Q2 is equal to GDP in 2020Q2. Conclusion: p-value=0.0 < 0.05 We can reject the null hypothesis that the mean of GDP in 2018Q2 equal to GDP in 2020Q2 at 0.05 (or even 0.001) significant level. The mean difference of 2018Q2 and 2020Q2 is significantly different from 0. Paired T-test
  • 23. Conclusion:For a 0.2 cohen d effect size, a power of 0.80, and a type I error of 0.05, we need a sample size of 393 (for each group). Power Analysis
  • 24. The limitations of the current result There is some limitations of the current result. The data related with direct marketing campagins of a Portuguese banking institution was cllected only one time. Only the data after the campagin was collected, but the data before the campagin was not collected. So it is hard for us to the Paried T-test to analysis the campagin’ s effectiveness. And this situation will influnce us make valuable business decisions, bcecuase we can not know whether this campagin is really effective or not. Plan of future research For the limitations, I need to do further research and collect new data for more comprehensive analysis.If we want to analyze the effectivenss of the campagin more clearly by the comparison and also to assure for a small Type II error and keep cohen d= 0.2, the new sample size will need to be 393 and need to be collect two time, including before and after the campagin.
  • 25. By Huan Yang Capstone Project Milestone 4: Regression
  • 26. Executive Summary •Summary of the data sources, the regression model you choose, and the result This data is about predicting whether a customer will change telecommunications provider, something known as "churning".The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean variable "churn" which indicates the class of the sample. I choose linear regression model. The regression equation is Yi=β 0+β1Xi+β2Xii +ei.The result is total night minutes and total day minutes are not correlated and total night minutes and total evening minutes are not correlated. • The URL to your Github repo that stored the Jupyter Notebook associated with this project. https://github.com/HuanJoyce/NYU_Integrated_Marketing/blob/main/“Huan_Yang_Predicting_Customer_Chu rn_ipynb”.ipynb • The URL to the data sources. (You are free to choose another dataset of your own interest) https://www.kaggle.com/c/customer-churn-prediction-2020
  • 27. Scatter plots Summary of the plots: From the plots, we can conclude that total day minutes and total night minutes have obvious linear relationship, and total evening minutes and total night minutes have obvious linear relationship.
  • 28. Regression Results Yi=β 0+β1Xi+β2Xii +ei. H0:β1=0,β2=0 Xi=total day minutes Xii=total eve minutes Yi=total night minutes Summary for the results: Both p-value for Xi and Xii>0.05, So we can not reject the hypothesis that total night minutes and total day minutes are not correlated. Besides, we also can not reject the hypothesis that total night minutes and total evening minutes are not correlated.
  • 29. Insights gained from the regression From the regression, we can conclude that total night minutes and total day minutes are not correlated and total night minutes and total evening minutes are not correlated. So for making marketing decisions, we do not need to worry about the changes in total day minutes or total evening minutes will influnce the total night minutes. So we can do some promotion, like design package of day and evening minutes, to attract customers make phone call in longer time and also to keep customers not to change providers.
  • 30. Assumptions Check Assumptions 1. The error term is normally distributed. For each fixed value of X, the distribution of Y is normal. 2. The means of all these normal distributions of Y, given X, lie on a straight line with slope b. 3. The mean of the error term is 0. 4. The variance of the error term is constant. This variance does not depend on the values assumed by X. 5. The error terms are uncorrelated. In other words, the observations have been drawn independently. 6. The independent variables in X are not correlated. This is no issues of multi-collinearity.
  • 31. Assumptions Check Summary of the results: From the plots, we can conclude that residuals and predictions have linear relationship, So Assumption 4 is satisfied. The variance of the error term is constant. This variance does not depend on the values assumed by X. From the previous scatter plots, we can also conclude that Assumption 2 is satisfied.The means of all these normal distributions of Y, given X, lie on a straight line with slope b.Because this data is not time series data, so we can omit Assumption5 in this case.
  • 32. Assumptions Check Summary of the results: P-value=0.77318>0.05 So we can not reject the hypothesis that with zero mean, the residual is normally distributed. So Assumption 1 and Assumption 3 are satisfied.The error term is normally distributed. For each fixed value of X, the distribution of Y is normal.The mean of the error term is 0.
  • 33. Assumptions Check Summary of the results: P-value=0.388>0.05 So we can not reject the hypothesis that total day minutes and total eve minutes are not correlated. So Assumption 6 is satisfied.The independent variables in X are not correlated. There is no issues of multi-collinearity 33www.islide.cc
  • 34. Further Research Although our existing data includes 19features, I think we also need to consider more influencing factors in future research, such as customer age and income. A more comprehensive research can provide a better reference for marketing decision and make future marketing strategy more targeted and effective. Besides, if we have enough budget, we can test a larger sample size to improve the accuracy of our research.
  • 35. By Huan Yang Capstone Project Milestone 5: Customer Segmentation
  • 36. Executive Summary •Summary of the data sources, the statistic method you choose, and the result This data is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all- occasion gifts. Many customers of the company are wholesalers.We will be using the online retail trasnational dataset to build a RFM clustering and choose the best set of customers which the company should target. I choose K- Means Clustering and Hierarchical Clustering.The result is the best K=3. Besides, k-Means Clustering returns 57 target customer , while Hierarchical Clustering returns 2 target customers , which is a much smaller group than the one that K- Means Clustering return. • The URL to my Kaggel Notebook https://www.kaggle.com/huanjoyceyang/customer-segementation-hy2208 • The URL to the data sources. https://www.kaggle.com/hellbuoy/online-retail-customer-clustering
  • 37. Background This analysis is based on France.
  • 38. K-Means Clustering When metric = “distortion”: computes the sum of squared distances from each point to its assigned center
  • 39. K-Means Clustering When metric = “silhouette”: calculates the mean Silhouette Coefficient of all samples.
  • 40. K-Means Clustering When metric = “calinski_harabasz”: computes the ratio of dispersion between and within clusters
  • 41. K-Means Clustering Beacuse we can not get the result of best K when we set metric = “calinski_harabasz”, so we need to find the best K based on the second one, which set metric = “silhouette”. So, the result is the best K=3.
  • 42. K-Means Clustering By the RFM criteria, we should choose the customer clusters with a lower recency, a higher frequency and amount. From the K-means clustering results, we can see that see that customers with Cluster_Id=2 best fit the criteria.
  • 43. K-Means Clustering We can see that we k-Means Clustering returns 57 target customer .
  • 44. Hierarchical Clustering Visualize Tree by Three Linkage Methods
  • 46. Hierarchical Clustering By the RFM criteria, we should choose the customer clusters with a lower recency, a higher frequency and amount. From the Hierarchical clustering results, we can see that customers with Cluster_Labels=2 has a higher frecrency and customers with Cluster_Labels=1 has a higher amount.So when we pick target customers, we need base on our company’s demand.
  • 47. Hierarchical Clustering When our company pay more attention to frequency, we choose Cluster_Id=2 to fit the criteria. We can see that Hierarchical Clustering returns 2 target customers , which is a much smaller group than the one that K- Means Clustering return.
  • 48. Hierarchical Clustering When our company pay more attention to frequency, we choose Cluster_Id=1 to fit the criteria. We can see that Hierarchical Clustering still returns 2 target customers , which is a much smaller group than the one that K- Means Clustering return.