Hy2208 final

CONTENTS
05
04
03
02
01
Appendix
Regression
Research design and The data
Learning outcomes
Self-Introduction

Huan(Joyce) Yang comes from Shenzhen. She graduated from Shenzhen
University, majored in Business Administration. Her classes’ syllabus is
developed based on teaching materials from Wharton School at the
University of Pennsylvania. This helps her get access to and seek
improvement from world-class business education resources. Then she
further equipped herself with a master's degree in Professional Accounting
and Corporate Governance at City University of Hong Kong. During her study,
she built up a solid foundation in many of the key fields in marketing:
Marketing, Mobile Application UX Design and Marketing Strategies for Mobile
Internet Products, and so on. She found that she was very interested in
marketing and also knew that Marketing is one of the most important parts of
a company‘s daily operation. so she chose <Integrated Marketing > Program
at NYU to try her best to learn more about marketing.
•Your Github Repo Link: https://github.com/HuanJoyce
•Your Kaggle Notebook Link(s)： https://www.kaggle.com/huanjoyceyang
•Your LinkedIn URL:https://www.linkedin.com/in/欢-杨-a2b8141b7/
Self- Inroduction

Learning outcomes
Understand and master the current popular
updated data analysis tools，such as
Github Repo.
Provides us with quantitative and
qualitative techniques
Learned how to use the statistic analysis
in marketing
2
1
3

Capstone Project Milestone 2: Research Design and The Data
Abstract：Because the price of a house is often the primary criterion for attracting customers to buy, the
appropriateness of the house’s pricing strategy is critical to the sale of real estate developers. So to help real
estate developers to understand the market and predict the house price, I researched this dataset, which
contains house sale prices for King County. It includes homes sold between May 2014 and May 2015. This
study will use linear regression as a data analysis tool to study the relationship between price and sqft_living
and the relationship between price and sqft_above.

Executive Summary
•Summary of the data sources, the regression model you choose, and the result
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold
between May 2014 and May 2015.
It's a great dataset for evaluating simple regression models.I choose linear regression model. The regression
equation is Yi=β 0+β1Xi+β2Xii +ei.The result is sqft_living and price are correlated, and sqft_above and price
are correlated.
• The URL to my Github repo that stored the Jupyter Notebook associated with this project.
https://github.com/HuanJoyce/NYU_Integrated_Marketing/blob/main/House_Sales_in_King_County%2C_US
A_Huan_Yang.ipynb
• The URL to the data sources. (You are free to choose another dataset of your own interest)
https://www.kaggle.com/harlfoxem/housesalesprediction

Scatter plots
Summary of the plots: From the plots, we can conclude that sqft_living and price have
obvious linear relationship, and sqft_above and price have obvious linear relationship.

Regression Results
Yi=β 0+β1Xi+β2Xii +ei.
H0:β1=0,β2=0
Xi=sqft_living
Xii=sqft_above
Yi=price
Summary for the results:
Both p-value for Xi and Xii<0.05, So
we can reject the hypothesis that
sqft_living and price are not correlated
Besides, we also can not reject the
hypothesis that sqft_above and price
are not correlated.

Insights gained from the regression
From the regression, we can conclude that sqft_living and sqft_above are
correlated. Besides, we also can not reject the hypothesis that sqft_above
and price are not correlated. So for making marketing decisions, we can
predict the future house price depending on sqft_living and sqft_above. If
the real estate developers use a wiser pricing strategy, the developers will
attract more consumers and have a higher income.

Assumptions Check
Assumptions
1. The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal. Not satisfied
2. The means of all these normal distributions of Y, given X, lie on a straight
line with slope b. Satisfied
3. The mean of the error term is 0. Satisfied
4. The variance of the error term is constant. This variance does not depend
on the values assumed by X. Not satisfied
5. The error terms are uncorrelated. In other words, the observations have
been drawn independently. Omit
6. The independent variables in X are not correlated. This is no issues of
multi-collinearity. Not satisfied

Assumptions Check
Summary of the results:
From the plots, we can conclude that
Assumption 4 is not satisfied. The
variance of the error term is not constant.
This variance depends on the values
assumed by X. From the previous scatter
plots, we can also conclude that
Assumption 2 is satisfied. The means of
all these normal distributions of Y, given X,
lie on a straight line with slope b.Because
this data is not time-series data, so we
can omit Assumption5 in this case.

Assumptions Check
P-value=0.0<0.05
So we can reject the hypothesis that with
zero mean, the residual is normally
distributed. So Assumption 1 is not
satisfied.
Assumption 3 is satisfied. The mean of
the error term is 0.

Assumptions Check
P-value=0.0<0.05
So we can reject the hypothesis that sqft_living and sqft_above are not
correlated. So Assumption 6 is not satisfied.
13www.islide.cc

Further Research
Because some assumptions are not satisfied, so I think we need to
consider changing to other data analysis tools, like generalized regression
in future research. More comprehensive research can provide a better
reference for marketing decisions and make future marketing strategies
more targeted and effective. Besides, if we have enough budget, we can
test more factors, that may influence the price, to improve the accuracy of
our research.

Capstone Project Milestone 2:
Research Design and the Data

By Huan Yang
Capstone Project Milestone 3-Hypothesis Testing

Summary of the data sources:
The first one data is related with direct marketing campagins of a Portuguese banking institution.The
second one data is related with Economy before and after COVID-19 in many countries.
The hypothesis you choose and the result :
For the analysis of the first data, I assume that The mean of balance equals between those who have loan
and those who do not have loan . The result is the mean of balance not equals between those who have
loan and those who do not have loan.
For the analysis of the second data, I assume that the same country’s GDP are correlated in 2018Q2 and
2020Ｑ2 . The result is the same country’s GDP are not correlated in 2018Q2 and 2020Q2. What’s more, I
also assume that the mean of GDP in 2018Q2 equal to GDP in 2020Q2. The result is the mean difference
of 2018Q2 and 2020Q2 is significantly different from 0.
The url to my Github repo :
https://github.com/HuanJoyce/NYU_Integrated_Marketing
The url to the data sources:
https://data.world/data-society/bank-marketing-data
https://stats.oecd.org/index.aspx?queryid=33940#

Why this test:
Because these data is matric data, and not
peried, two group, so use two-sample T-test.
Hypothsis:
The mean of balance equals between those
who have loan and those who do not has loan
Conclusion:
p-value=2.764056e-137< 0.05
We can reject the null hypothesis that the
mean of balance equals between those who
have house load and those who do not has
house load at 0.05 (or even 0.001) significant
level.
Two-Sample T-test

Test for multivariate normality
Why this test:
We need do the choice of using spearman
correlation or person correlation based on this
test.
Hypothsis:
This data is multivariate normality.
Conclusion:
p-value=0.1093> 0.05
We can not reject the null hypothesis that this
data is multivariate normality.
at 0.05 (or even 0.001) significant level.

Perason correlation
Why this test:
Because these data is not outlier and normally
distributed, and we want to test the correlation
of GDP in 2018Q2 and 2020Q2, so we need to
use Perason correlation .
Hypothsis:
The same country’s GDP are correlated in
2018Q2 and 2020Ｑ2.
Conclusion:
p-value=0.0< 0.05
We can reject the null hypothesis that the same
country’s GDP are correlated in 2018Q2 and
2020Ｑ2.at 0.05 (or even 0.001) significant
level. It means there is no correlation between
GDP in 2018Ｑ 2 and 2020Q2.

Why this test:
Because these data is matric data, and we want
to compare the difference of the same group’s
mean in two different period.
Hypothsis:
The mean of GDP in 2018Q2 is equal to GDP
in 2020Q2.
Conclusion:
p-value=0.0 < 0.05
We can reject the null hypothesis that the
mean of GDP in 2018Q2 equal to GDP in
2020Q2 at 0.05 (or even 0.001) significant
level. The mean difference of 2018Q2 and
2020Q2 is significantly different from 0.
Paired T-test

Conclusion:For a 0.2 cohen d effect size, a power of 0.80, and a type I error of 0.05, we need a sample size of
393 (for each group).
Power Analysis

The limitations of the current result
There is some limitations of the current result. The data related with direct marketing campagins of a
Portuguese banking institution was cllected only one time. Only the data after the campagin was
collected, but the data before the campagin was not collected. So it is hard for us to the Paried T-test to
analysis the campagin’ s effectiveness. And this situation will influnce us make valuable business
decisions, bcecuase we can not know whether this campagin is really effective or not.
Plan of future research
For the limitations, I need to do further research and collect new data for more comprehensive analysis.If
we want to analyze the effectivenss of the campagin more clearly by the comparison and also to assure
for a small Type II error and keep cohen d= 0.2, the new sample size will need to be 393 and need to be
collect two time, including before and after the campagin.

By Huan Yang
Capstone Project Milestone 4: Regression

Executive Summary
•Summary of the data sources, the regression model you choose, and the result
This data is about predicting whether a customer will change telecommunications provider, something known
as "churning".The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean
variable "churn" which indicates the class of the sample.
I choose linear regression model. The regression equation is Yi=β 0+β1Xi+β2Xii +ei.The result is total night
minutes and total day minutes are not correlated and total night minutes and total evening minutes are not
correlated.
• The URL to your Github repo that stored the Jupyter Notebook associated with this
project.
https://github.com/HuanJoyce/NYU_Integrated_Marketing/blob/main/“Huan_Yang_Predicting_Customer_Chu
rn_ipynb”.ipynb
• The URL to the data sources. (You are free to choose another dataset of your own interest)
https://www.kaggle.com/c/customer-churn-prediction-2020

Scatter plots
Summary of the plots: From the plots, we can conclude that total day minutes and total
night minutes have obvious linear relationship, and total evening minutes and total night
minutes have obvious linear relationship.

Regression Results
Yi=β 0+β1Xi+β2Xii +ei.
H0:β1=0,β2=0
Xi=total day minutes
Xii=total eve minutes
Yi=total night minutes
Summary for the results:
Both p-value for Xi and Xii>0.05,
So we can not reject the
hypothesis that total night
minutes and total day minutes
are not correlated. Besides, we
also can not reject the
hypothesis that total night
minutes and total evening
minutes are not correlated.

Insights gained from the regression
From the regression, we can conclude that total night minutes and total
day minutes are not correlated and total night minutes and total evening
minutes are not correlated. So for making marketing decisions, we do not
need to worry about the changes in total day minutes or total evening
minutes will influnce the total night minutes. So we can do some promotion,
like design package of day and evening minutes, to attract customers
make phone call in longer time and also to keep customers not to change
providers.

Assumptions Check
Assumptions
1. The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal.
2. The means of all these normal distributions of Y, given X, lie on a straight
line with slope b.
3. The mean of the error term is 0.
4. The variance of the error term is constant. This variance does not depend
on the values assumed by X.
5. The error terms are uncorrelated. In other words, the observations have
been drawn independently.
6. The independent variables in X are not correlated. This is no issues of
multi-collinearity.

Assumptions Check
From the plots, we can conclude that
residuals and predictions have linear
relationship, So Assumption 4 is
satisfied. The variance of the error term
is constant. This variance does not
depend on the values assumed by X.
From the previous scatter plots, we can
also conclude that Assumption 2 is
satisfied.The means of all these normal
distributions of Y, given X, lie on a straight
line with slope b.Because this data is not
time series data, so we can omit
Assumption5 in this case.

Assumptions Check
P-value=0.77318>0.05
So we can not reject the hypothesis that
with zero mean, the residual is normally
distributed. So Assumption 1 and
Assumption 3 are satisfied.The error
term is normally distributed. For each
fixed value of X, the distribution of Y is
normal.The mean of the error term is 0.

Assumptions Check
P-value=0.388>0.05
So we can not reject the hypothesis that total day minutes and total eve minutes
are not correlated. So Assumption 6 is satisfied.The independent variables in X
are not correlated. There is no issues of multi-collinearity
33www.islide.cc

Further Research
Although our existing data includes 19features, I think we also need to
consider more influencing factors in future research, such as customer
age and income. A more comprehensive research can provide a
better reference for marketing decision and make future marketing
strategy more targeted and effective. Besides, if we have enough
budget, we can test a larger sample size to improve the accuracy of
our research.

By Huan Yang
Capstone Project Milestone 5: Customer Segmentation

Executive Summary
•Summary of the data sources, the statistic method you choose, and the result
This data is a transnational data set which contains all the transactions occurring between 01/12/2010 and
09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-
occasion gifts. Many customers of the company are wholesalers.We will be using the online retail trasnational
dataset to build a RFM clustering and choose the best set of customers which the company should target.
I choose K- Means Clustering and Hierarchical Clustering.The result is the best K=3. Besides, k-Means
Clustering returns 57 target customer , while Hierarchical Clustering returns 2 target customers , which is a
much smaller group than the one that K- Means Clustering return.
• The URL to my Kaggel Notebook
https://www.kaggle.com/huanjoyceyang/customer-segementation-hy2208
• The URL to the data sources.
https://www.kaggle.com/hellbuoy/online-retail-customer-clustering

Background
This analysis is based on France.

K-Means Clustering
When metric = “distortion”: computes the sum
of squared distances from each point to its
assigned center

K-Means Clustering
When metric = “silhouette”: calculates the mean
Silhouette Coefficient of all samples.

K-Means Clustering
When metric = “calinski_harabasz”:
computes the ratio of dispersion between and
within clusters

K-Means Clustering
Beacuse we can not get the result of best K when we set metric = “calinski_harabasz”,
so we need to find the best K based on the second one, which set metric = “silhouette”.
So, the result is the best K=3.

K-Means Clustering
By the RFM criteria, we should choose the customer clusters with a lower
recency, a higher frequency and amount. From the K-means clustering results,
we can see that see that customers with Cluster_Id=2 best fit the criteria.

K-Means Clustering
We can see that we k-Means Clustering
returns 57 target customer .

Hierarchical Clustering
Visualize Tree by Three Linkage Methods

Perform Clustering by choosing best k=3

By the RFM criteria, we should choose the customer clusters with a lower recency, a higher
frequency and amount. From the Hierarchical clustering results, we can see that customers with
Cluster_Labels=2 has a higher frecrency and customers with Cluster_Labels=1 has a higher
amount.So when we pick target customers, we need base on our company’s demand.

When our company pay more attention to
frequency, we choose Cluster_Id=2 to fit the
criteria.
We can see that Hierarchical Clustering
returns 2 target customers , which is a
much smaller group than the one that K-
Means Clustering return.

When our company pay more attention to
frequency, we choose Cluster_Id=1 to fit the
criteria.
We can see that Hierarchical Clustering still
returns 2 target customers , which is a
much smaller group than the one that K-
Means Clustering return.

Hy2208 final

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Hy2208 final

Similar to Hy2208 final (20)

Recently uploaded

Recently uploaded (20)

Hy2208 final