2. 2
Self-introduction
URL of Kaggle Notebook
https://www.kaggle.com/yanlewang/customer-segementation-yw5178
Github URL:
https://colab.research.google.com/github/Yanle57/NYU_Integrated_Marketing
Linkedin:
www.linkedin.com/in/yw-wang-04a3231b7
Yanle (Mike) Wang received his bachelor’s degree in
management from Shanghai University of International
Business and Economics. He has great passion for Marketing
and enjoys creating value for customers. He has interned for
Hitachi (China) and Nike, learning different market-oriented
strategies to enhance brand value. He is also interested in doing
volunteer works to meet different kinds of people and help
them. He thinks this is a way to learn the need and pain from
potential customers. He has won prize for “Internet plus”
innovation and entrepreneurship in solving online tutor in
poverty areas.
3. 3
Course Summary
I have learnt many theory and data modeling methods from the course. In the theoretical
part, the methods and logic of many investigations have given me a deeper
understanding of what market research should do. The three platforms used in the class
also allowed me to experience the entire process from finding data to final analysis and
modeling.
Through this course, I really discovered that LinkedIn is also a very good platform to
let others know you. I am now getting more and more interested in doing research and
investigation, and I want to learn more about programming in the future to make better
use of these platforms and software. I am looking for an internship in data marketing to
prepare for my future career
4. 4
Market research report-Session 1
Data Set
All annual data of the variables from 1973 to 2016 are collected from National Bureau of
Statistics.
http://www.stats.gov.cn (Some missing data are estimated by Yanle Wang)
Y: The life expectancy at birth of the population
life expectancy at birth is our most commonly used life expectancy indicator. It shows the average
number of years the new born population is expected to survive and is an important indicator of the
health of the population.
X1: College enrollment rate (% of total population)
The college enrollment rate Macroscopically shows the eduction level of a country. We consider
that the increase of college enrollment rate will certainly lead to the increase of the life expectancy.
However, we do not think this relationship is a single linear one.
X2:GDP ( Taking 1973 as Fixed Base Index )
GDP refers to the total sum of all final products and services produced by a country for a given
period of time and it is often considered as an indicator of country’s economic state. This paper
assumes that the economic situation reflect people’s consumption level in a period of time. It will
affect the volume of people’s spending on the all life-related behaviour. This paper predicts that
GDP has a significantly positive impact on Y. And in this paper, we set the date of 1973 as the base.
X3: Hospital bed number(per thousand people)
The paper assumes that the hospital bed number (per thousand people) of the medical industry is an
important indicator that shows the medical resource that per person can use. Recent years, with the
continuous improvement of medical level, more diseases can be cured. However, from a macro
angle, it is the medical resources that per person can use contribute to their life expectancy. We also
consider that the relationship is positive.
5. 5
Market research report-Session 2
Research Abstract
The Chinese population has entered a period of healthy post-transformation, and the level of death
has shown a steady decline, and the life expectancy of the population has continued to increase.
Although it is difficult to predict how long a particular person will live, it is possible to calculate
and inform, by a scientific method, the Life expectancy at birth of the population is feasible to study.
It is of great practical significance to study the life expectancy in the country and this paper will
focus on what factors will have influence on the life expectancy at birth of the population from the
macro perspective. The research will use linear regression to study the relationship between Life
expectancy at birth of the population and other variables.
Google Datastudio Link for Public:
https://datastudio.google.com/reporting/9adf893b-7baa-45df-aa28-46662f9dd349
6. 6
Market research report-Session 3
Research Executive Summary
Yw5178
Yanle Wang
All annual data of the variables from 1973 to 2016 are collected from National
Bureau of Statistics.
http://www.stats.gov.cn (Some missing data are estimated by Yanle Wang)
[The GDP ($) takes 1973 as Fixed Base Index.]
It is of great practical significance to study the life expectancy in the country and
this project will focus on what factors will have influence on the life expectancy
at birth.
In this project, we will use linear regression model to check the correlation
between College enrollment rate (% of total population) (X1), GDP (X2),
Hospital bed number (per thousand people) (X3) and Life expectancy at birth of
the population (Y). The assumption model and H0 is listed below.
Y=B1X1+B2X2+B3X3+B0+e ---[H0: B1=0, B2=0, B3=0. (α=0.05)]
The result showed that (X1) (X2) (X3) and (Y) are significantly related. However,
the model need further revised after assumption check.
Github URL:
https://colab.research.google.com/github/Yanle57/NYU_Integrated_Marketi
ng
7. 7
Conduct the Analysis
Scatterplot 1
For X1: College enrollment rate (% of total population) and Y: Life
expectancy at birth of the population
The scatterplot shows a proportional (active) liner relationship.
College enrollment rate might have a positive effect on Life
expectancy at birth of the population.
However, it is also possible that the relationship between them is not
linear but increasing in square.
8. 8
Conduct the Analysis
Scatterplot 2
For X2: GDP and Y: Life expectancy at birth of the population
The scatterplot does not show a clear relationship.
It seems that when GDP level is more than 1.5 E+11, it might have a
positive effect on Life expectancy at birth of the population.
9. 9
Conduct the Analysis
Scatterplot 3
For X3: Hospital bed number (per thousand people) and Y: Life
expectancy at birth of the population
The scatterplot shows a proportional (active) liner relationship.
More hospital bed providing might have a positive effect on Life
expectancy at birth of the population.
10. 10
Conduct the Analysis
Summery
•I choose this Group regression to see relationship between the X1, X2 and Y.
Y=B1X1+B2X2+B3X3+B0+e ---[H0: B1=0, B2=0, B3=0. (α=0.05)]
X1:
•For P-Value =0.00 < 0.05, type I error, reject and B1=0.3212
For each one percentage increase in College enrollment rate, the life expectancy
at birth of the population will be increased by 0.3212.
X2
•For P-Value =0.00 < 0.05, type I error, reject and B2= -4.106e-11
For each increase in total GDP, the life expectancy at birth of the population will
be decreased by -4.106e-11.
X3
•For P-Value =0.00 < 0.05, type I error, reject and B3=2.2161
For each one increase in Hospital bed number (per thousand people), the life
expectancy at birth of the population will be increased by 2.2161.
12. 12
Assumptions Check and Further Research
After assumption check, 2 assumptions are satisfied and 4 are not satisfied.
1. Because p=0.00 we reject the H0: X1 and X2/ X2 and X3/ X1 and X3 are not
correlated, respectively. (Not Satisfied)
2. As the predictions increase, the residual plot shows an unparallel distribution,
which shows the variance of the error term is not constant. (Not Satisfied)
3. The mean of the error term is close to 0. (Satisfied)
4. The observations have been drawn not independently indicates that the error
term is not normally distributed. (Not Satisfied)
5. Error terms are uncorrelated. (Satisfied)
6. Scatter plot (Not Satisfied)
The means of Y, given X1, do not lie on a straight line with slope B1.
The means of Y, given X2, do not lie on a straight line with slope B2.
The means of Y, given X3, lie on a straight line with slope B3.
13. 13
Conclusion and Suggestion:
Our model needs further improvement. The pure linear regression might be not
good for this data, Especially for X1.
As a marketing manager, the model result seems not reasonable.
1. GDP growth should have a positive effect on Y. This might be caused by the
correlation of independent data. Future research should avoid using such
correlated variable combinations.
2. It is also possible that the relationship between College enrollment rate and Life
expectancy at birth of the population is not linear but increasing in square. Linear
regression can be done to test X1^2 and Y.
15. 15
Capstone Project Milestone 3: Hypothesis
Testing
Executive Summary
Yw5178
Yanle Wang
Data Sources:
Direct Marketing campaigns data (phone calls) of a Portuguese banking institution:
https://data.world/data society/bank marketing data
Economy data before and after Covid 19:
https://stats.oecd.org/index.aspx?queryid=33940#
Hypothesis Test-Parametric Test: T-Test, Paired T-Test and Group T-Test
In population, the Null Hypothesis is True,a type I error of 0.05
Github URL:
https://colab.research.google.com/github/Yanle57/NYU_Integrated_Marketing
16. 16
Hypothesis Test 1
•I choose this test to see whether the data population is normal distribution or
not.
•H0: The data population is in normal distribution
•Since the result is Normal=False, I choose Spearman correlation.
•For alpha=0.05, P-Value=0.005417 < 0.05, type I error, reject.
17. 17
Hypothesis Test 2
•I choose this Paired T-Test to see whether the data population in different time
period has significant difference.
•H0: The data population in different time period has no significant difference
•The data population in different time period has significant difference
•For power=0.09, cohen d=0.1, P-Value =0.0 < 0.05, type I error, reject.
18. 18
Hypothesis Test 3
•I choose this Group T-Test to see whether the two data populations have
significant difference.
•H0: The two data populations do not have significant difference
•The two data populations have significant difference
•For power=1, cohen d=0.23077, P-Value =2.76e-137 < 0.05, type I error, reject.
19. 19
Power Analysis and Final Remarks
Conclusion and limitation: For a 0.15 cohen d effect size, a power of 0.80, and a
type I error of 0.05, we need a sample size of 699 (for each group), which is a
valuable decision for business use.
If my boss told me that the revenue of the product will increase by more than 10%
rather than 15% of the standard deviation, the sample size well be increased to
1571.
20. 20
Capstone Project Milestone 4: Regression
Executive Summary
Yw5178
Yanle Wang
The data source is a Customer Churn Prediction in 2020. It is used to predict
whether a customer will change telco provider.
In this project, we will use linear regression model to check the correlation
between Total_day_minutes (X1), Total_day_calls (X2) and Total_day_charge
(Y). The assumption model and H0 is listed below.
Y=B1X1+B2X2+B0+e
H0: B1=0, B2=0. (α=0.05)
The result showed that Total_day_minutes (X1) and Total_day_charge (Y)
have strong positive relationship. Total_day_calls (X2) and Total_day_charge
(Y) are not significantly related.
Data Sources: https://www.kaggle.com/c/customer-churn-prediction-2020
Github URL:
https://colab.research.google.com/github/Yanle57/NYU_Integrated_Marketi
ng
21. 21
Conduct the Analysis
Scatterplot 1
For X1: Total_day_minutes and Y: Total_day_charge
The scatterplot shows a proportional (active) liner relationship.
Total_day_minutes might have a positive effect on
Total_day_charge.
22. 22
Conduct the Analysis
Scatterplot 2
For X2: Total_day_calls and Y: Total_day_charge
The scatterplot does not show a clear relationship.
It is hard to predict the relation between Total_day_calls and
Total_day_charge at this stage.
23. 23
Conduct the Analysis
Summery
•I choose this Group regression to see relationship between the X1, X2 and Y.
Y=B1X1+B2X2+B0+e
H0: B1=0, B2=0. (α=0.05)
X1:
•For P-Value =0.00 < 0.05, type I error, reject and B1=0.170
For each increase in total day time minute, the total day time charge will be
increased by 0.17. (0.17 dollar/minute)
X2
•For P-Value =0.395 > 0.05, not reject. B2=0
For each increase in total day time call, there is no significant increase in the total
day time charge.
25. 25
Assumptions Check and Further Research
After assumption check, 4 assumptions are satisfied and 2 are not satisfied.
1. Because p=0.961 we accept the H0: X1 and X2 are not correlated. (Satisfied)
2. As the predictions increase, the residual plot shows a parallel distribution, which
shows the variance of the error term is constant. (Satisfied)
3. The mean of the error term is very close to 0. (Satisfied)
4. The observations have been drawn not independently indicates that the error
term is not normally distributed. (Not Satisfied)
5. Error terms are uncorrelated. (No time series data). (Satisfied)
6. The means of Y, given X1, lie on a straight line with slope B1.
The means of Y, given X2, do not lie on a straight line with slope B2. (Not Satisfied)
Suggestion: Our model needs further improvement. The linear regression might
be not good for this data. As a marketing manager, the model result seems
reasonable that the Total_day_minutes might have a positive effect on
Total_day_charge. However, more test like the relation between night_minutes
and night_charge / eve_minutes and eve_charge should be done to judge whether
Total minutes and Total charge have the positive and valid relationship.
26. 26
Capstone Project Milestone 5: Clustering
Executive Summary
Yw5178
Yanle Wang
• Online retail customer clustering
https://www.kaggle.com/hellbuoy/online-retail-customer-clustering
Online retail is a transnational data set which contains all the transactions
occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered
non-store online retail. The company mainly sells unique all-occasion gifts.
Many customers of the company are wholesalers.
The Goal of Customer Segmentation: find the target customer. (Based on RFM :
Recency, Frequency, and Monetary):
• URL of Kaggle Notebook
https://www.kaggle.com/yanlewang/customer-segementation-yw5178
I choose Germany retail customers’ data to do the Cluster Analysis. I use the K-
Means Clustering and Hierarchical Clustering statistic methods to find the K and
the target customer based on RFM.
27. 27
K-Means Clustering
I use “distortion”, “silhouette”, and “calinski_harabasz” metric to get a k.
When metric= “distortion”, K=4.
When metric= “silhouette”, I cannot find K.
When metric= “calinski_harabasz”, K=4.
Finally I choose the best K=4 (calinski_harabasz)
28. 28
By the RFM criteria, we should choose the customer clusters with a lower
recency, a higher frequency and amount.
From the K-means clustering results, we can see that see that customers with
Cluster_Id =2 best fit the criteria.
k-Means Clustering returns 3 target Customer.
30. 30
By the RFM criteria, we should choose the customer clusters with a lower
recency, a higher frequency and amount.
From the K-means clustering results, we can see that customers with
Cluster_Labels=2 best fit the criteria.
Hierarchical Clustering returns 8 target customer.
31. 31
Future plan
K-Means Clustering and Hierarchical Clustering returns 3 and 8 target customer
respectively.
As a marketing manager, I would like to target on 3 customer group to making
precision marketing plans and campaigns in the short term.
However, since 8 target customers are all good based on RFM : Recency,
Frequency, and Monetary, I will keep paying attention to them in the long term.