Final Presentation
STATISTICAL MEASUREMENTS, ANALYSIS & RESEARCH
Oliver Gong
Net ID: zg2088
Instructor: Dr. Luyao Zhang
Contents
Self-Introduction01
Key Learnings02
Market Research Report03
Appendix04
01 Self-Introduction
PART ONE
Self-Introduction
I am Ziyu Gong also go by Oliver. I received Bachelor’s degree in
Accounting from China University of Mining and Technology(Beijing).
In the seconded year of university, I went to Industrial and Commercial
Bank of China(Shangrao) Corporate Businiess Department as an intern,
Supporting the account manager in loan marketing by visiting clients
to know about their demands for bank funds and collect necessary
documents required for loan approval. That’s where I experienced and
comprehended marketing practice for the first time. In the third year
of university, I interned in Everbright Securities Investment Banking
Department, conducting corporate and financial due diligence for an
IPO project, responsibilities included collecting and checking
confirmation requests and analyzing financial statements. As an
aspiring person willing to take challenges, I set my goal to devote
myself to marketing analysis industry in the future.
B.S. in Accounting |China University of Mining and Technology(Beijing)
LinkedIn URL: https://www.linkedin.com/in/%E5%AD%90%E8%88%86-%E9%BE%9A-18b8101b7/
GitHub Repo	Link:	https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing
Kaggle	Notebook	Link:	https://www.kaggle.com/olivergong77/customer-segementation-zg2088
02 Key Learnings
PART TOW
Key Learnings
The most important thing I learned from this lesson was to use tools like
Goole Data Studio, Github and Kaggle to analyze data. In terms of
application, I learned to conduct hypothesis testing, such as T-test, Anova,
Chi-Square and other testing methods. I also learned to analyze correlations
and linear regression. In addition, I learned to apply k-mean Clustering and
Hierarchical Clustering methods to segment customers.
As for my professional growth, the role of this class is huge. The professor
has repeatedly stressed the importance of applying what is taught in class
to our future work. The application tools taught in the classroom are
relatively advanced. If we use these tools for data analysis in future work, it
will greatly improve the work efficiency and reliability, and also enhance our
competitiveness.
03Market Research Report
PART THREE
Session 1: New Dataset
Supermarket XYZ has been operating since 2008 and business flourished until 2016.
They have a large database but they do not use them to achieve better business
solutions. Their annual revenues have declined 10% and it seems to stay that way
every year.
Through the membership card, Supermarket XYZ got some basic information about
the customer like Customer ID, age, gender, annual income and spending score.
Spending Score is something you assign to the customer based on your defined
parameters like customer behavior and purchasing data.
Supermarket_CustomerMembers.csv: This dataset I used for analyzing the
consumer age structure and spending and regression analysis.
Supermarket XYZ Customer data:
https://www.kaggle.com/sindraanthony9985/marketing-data-for-a-
supermarket-in-united-states
XYZ Supermarket Consumer Age Structure and Spending score Analysis Report
Session 2: Research Design and The Data
XYZ Supermarket, whose annual revenue will
drop by 10% starting in 2016, has a huge
database and obtains basic information about
200 customers, such as Customer ID, age,
gender, annual income and spending score. By
analyzing the age structure and spending
score of customers, I want to understand the
consumption level of customers of different
ages in XYZ supermarket, so that I can do
better marketing accordingly in the future.
According to the study, the spending score of
30-35 years old is relatively high, while that
of 19-29 years old is relatively low. In the
future XYZ supermarket should obtain more
basic customer information to make the
results more reliable.
Google data studio report URL: https://datastudio.google.com/reporting/61b9ad02-8e3f-48c3-84f4-59a61a91ef07
• Supermarket XYZ Customer data:
• https://www.kaggle.com/sindraanthony9985/marketing-data-for-a-supermarket-
in-united-states
Through the membership card, Supermarket XYZ got some basic information about the
customer like Customer ID, age, gender, annual income and spending score.
• URL of Github report:
https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing
I conducted a linear regression analysis and found that ages affect the spending score
and annual income did not affect the spending score.
Session 3: Regression
Conduct the Analysis --- Scatter Plot
Since I wanted to know whether there was a linear regression relationship between the age
and the Spending Score and whether there was a linear regression relationship between the
annual income and the Spending Score , I made a scatter plot with the data.
I found no linear regression relationship between these variables.
Conduct the Analysis --- Regression Result
Null hypothesis: β1=0 and β2=0
Result: The X1 P-value = 0 < 0.05: we conclude that at the significant level 0.05, we can reject
the null hypothesis that β1=0.
The X2 P-value = 0.931 > 0.05: we conclude that at the significant level 0.05, we can’t reject
the null hypothesis that β2=0.
Based on the previous step, I set
the ages as the independent
variable X1, the annual income
as the independent variable X2,
and the spending score as the
dependent variableY. Then I
made the null hypothesis and
did a linear regression analysis.
Conduct the Analysis --- Insights and making decisions
It can be concluded from linear regressionanalysis that
(1) the age affects the spending score.
(2) The annual income does not affect the spending score.
Therefore, if we want to increase customer’s spending score to make
more revenue, we should further understand which age groups have
higher consumption levels and market to different age groups.
Assumptions	Check
Then I went to check whether the six assumptions I use are likely to be satisfied.
I found that one of them is not satisfied and five of them are satisfied.
Assumption 4: The varianceof the error term is constant. This variancedoes not depend on
the values assumed by X.
We can see from the scatter plot below, the assumption 4 is satisfied.
Assumptions	Check
Assumption 2: The means of all these normal distributions of Y, given X, lie on a straight line
with slope b.
We can see from the scatter plot on page 11, the assumption 2 is not satisfied.
Assumption 1&3: The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal. The mean of the error term is 0.
We can see from the diagram below , the assumption 1&3 is satisfied.
Assumption 5: The error terms are uncorrelated. In other words, the
observations have been drawn independently.
Since our data is not time series data, the assumption 5 is satisfied.
Assumption 6: The independent variables in X are not correlated. This is no
issues of multi-collinearity.
We can see the P-value=0.781 > 0.05, we conclude that at the significant
level 0.05, we can’t reject the null hypothesis that the independent variables
in X are not correlated. So the assumption 6 is satisfied.
Further	Research	
As a supermarket, we should expand the collection of customer
data and add sample points to make the results more reliable.
Then we should further understand which age groups have higher
consumption levels and segment to different age groups. Finally,
we should develop different marketing strategies for different
age groups.
04 Appendix
PART FOUR
Capstone Project Milestone 2: Research Design and The Data
Capstone Project Milestone 3: Hypothesis Testing
• Bank Marketing Data:
https://data.world/data-society/bank-marketing-data
The data is related with direct marketing campaigns of a Portuguese banking institution. The
marketing campaigns were based on phone calls.
• G20 GDP Data:
https://stats.oecd.org/index.aspx?queryid=33940#
The annual GDP of each country for each quarter.
• URL of Github report:
https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing
I use paired test, Spearman test and one-sample t-test to test the null hypothesis,the
conclusion are all significant.
Name: Oliver Gong
ID number: N14152886
NetID: zg2088
Since the two different groups data are metric data and we need to test the correlation of GDP of different
countries between the the fourth quarter of 2018 and 2019, we do the paired tests.
Conclusion: The P-value=0 < 0.05: we conclude that at the significant level 0.05, we can reject the null hypothesis
that the means of GDP per capita for the fourth quarter of 2018 and 2019 for all countries are the same.
Three Hypothesis Tests --- Paired Tests
Null hypothesis: the means of GDP per
capita for the fourth quarter of 2018 and
2019 for all countries are the same.
Since the normal equals False, we use the Spearman to
test correlations.
Null hypothesis: the GDP of the certain country in the
fourth quarter of 2018 and 2019 is not correlated.
Three Hypothesis Tests --- Spearman Tests
Conclusion: The P-value < 0.05: we conclude that at the
significant level 0.05, we can reject the null hypothesis
that the GDP of the certain country in the fourth quarter
of 2018 and 2019 is not correlated.
Three Hypothesis Tests --- One-Sample T-test
Since I want to test whether the means of one single group of data is true, I use One-Sample
T-test to test the mean of balance.
Null hypothesis: the mean of balance equals 300.
Result: The P-value < 0.05: we conclude that at the significant level 0.05, we can reject the
null hypothesis that the mean of balance equals 300.
We want to know the sample size of the research, so we set the cohen d, power and alpha to do the power
analysis.
Result: For a 0.77 cohen d effect size, a power of 0.80, and a type I error of 0.05, we need a sample size of 27 (for
each group).
There are 224 countries and regions in the world. Now we just compare the quarterly GDP of 20 countries. So our
conclusions are not very strong. In the future, we should increase the sample size and obtain the GDP data of
each quarter of all countries and regions in the world to make our conclusion more reliable.
Power Analysis and Final Remarks
• Customer	Churn	Prediction	2020
https://www.kaggle.com/c/customer-churn-prediction-2020
This	competition	is	about	predicting	whether	a	customer	will	change	telecommunications	
provider,	something	known	as	“churning”. The	dataset	contains	4250	samples.	Each	sample	
contains	19	features	and	1	boolean variable	"churn"	which	indicates	the	class	of	the	sample.
• URL of Github report:
https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing
I conducted a linear regression analysis and found that total minutes of day calls and the
total minutes of eve callsdid not affect the total minutes of night calls.
Name: Oliver Gong
ID number: N14152886
NetID: zg2088
Capstone Project Milestone 4: Regression
Conduct the Analysis --- Scatter Plot
Since I wanted to know whether there was a linear regression relationship between the total
day minutes and total night minutes and whether there was a linear regression relationship
between the total eve minutes and total night minutes, I made a scatter plot with the data.
Conduct the Analysis --- Regression Result
Null hypothesis: β1=0 and β2=0
Result: The X1 P-value > 0.05: we conclude that at the significant level 0.05, we can’t reject
the null hypothesis that β1=0.
The X2 P-value > 0.05: we conclude that at the significant level 0.05, we can’t reject the null
hypothesis that β2=0.
Based on the previous step, I set
the total day minutes as the
independent variable X1, the
total eve minutes as the
independent variable X2, and the
total night minutes as the
dependent variableY. Then I
made the null hypothesis and
did a linear regression analysis.
Conduct the Analysis --- Insights and making decisions
It can be concluded from linear regressionanalysis that
(1) the total minutes of day calls does not affect the total minutes of
night calls.
(2) the total minutes of eve calls does not affect the total minutes of
night calls.
Therefore, if we want to retain customers, we should give discounts
package to customers who call at different periods.
Assumptions	Check
Then I went to check whether the six assumptions I use are likely to be satisfied.
I found that one of them is not satisfied and five of them are satisfied.
Assumption 4: The varianceof the error term is constant. This variancedoes not depend on
the values assumed by X.
We can see from the scatter plot below, the assumption 4 is satisfied.
Assumptions	Check
Assumption 2: The means of all these normal distributions of Y, given X, lie on a straight line
with slope b.
We can see from the scatter plot on page 2, the assumption 2 is not satisfied.
Assumption 1&3: The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal. The mean of the error term is 0.
We can see from the diagram below , the assumption 1&3 is satisfied.
Assumption 5: The error terms are uncorrelated. In other words, the
observations have been drawn independently.
Since our data is not time series data, the assumption 5 is satisfied.
Assumption 6: The independent variables in X are not correlated. This is no
issues of multi-collinearity.
We can see the P-value=0.388 > 0.05, we conclude that at the significant
level 0.05, we can’t reject the null hypothesis that the independent variables
in X are not correlated. So the assumption 6 is satisfied.
Further	Research	
As a telecommunications company, we should find some factors
that can significantly influence the customer churn rate in the
future, and give correspondingrecommendations to reduce this
factor.
• Onlineretail customer clutering
https://www.kaggle.com/hellbuoy/online-retail-customer-clustering
Online	retail	is	a	transnational	data	set	which	contains	all	the	transactions	occurring	between	
01/12/2010	and	09/12/2011	for	a	UK-based	and	registered	non-store	online	retail.	The	company	
mainly	sells	unique	all-occasion	gifts.	Many	customers	of	the	company	are	wholesalers.
• URL of Kaggle Notebook:
https://www.kaggle.com/olivergong77/customer-segementation-zg2088
I choose France retail customers’ data to do the Cluster Analysis. I use the K-Means
Clustering and Hierarchical Clusteringto get the best k and find the target customer
clusters which we need to pay attention to.
Name: Oliver Gong
ID number: N14152886
NetID: zg2088
Capstone Project Milestone 5: Clustering
K-Means Clustering --- Finding the best k
I choose France retail customers’ data to do the
Cluster Analysis.
When metric = “distortion”, I got k = 4;
When metric = “silhouette”, I got k = 3;
When metric= “calinski_harabasz”, I didn’t get a k.
So I finally found the best k = 3
K-Means Clustering --- Visualize the cluster with the best k and summarize
By the RFM criteria, we should choose the customer clusters
with a lower recency, a higher frequency and amount.
From the K-means clustering results, we can see that see
that customers with Cluster_Id=2 best fit the criteria.
We can see that we k-Means Clustering returns 18 target
customer.
Hierarchical Clustering --- Linkage methods
By following three Linkage methods, I draw the tree diagrams.
Then I do the hierarchical clustering accordingto k=3.
Hierarchical Clustering --- Visualize the cluster with the best k and summarize
By the RFM criteria, we should choose the customer clusters
with a lower recency, a higher frequency and amount.
From the K-means clustering results, we can see that
customers with Cluster_Labels=2 best fit the criteria.
We can see that Hierarchical Clusteringreturns 2 target
customer.
Further Research
We can see that k-Means Clustering returns 18 target customer.
We can see that Hierarchical Clustering returns 2 target customer,
which is a much smaller group than the one that K-Means Clustering
return.
In the actual work, if there are only 2 clusters, the number of people
surveyed will be relatively small and the results are not reliable enough.
Therefore, I prefer to use the K-Means Clustering.
THANKS
FOR
WATCHING

Final presentation zg2088

  • 1.
    Final Presentation STATISTICAL MEASUREMENTS,ANALYSIS & RESEARCH Oliver Gong Net ID: zg2088 Instructor: Dr. Luyao Zhang
  • 2.
  • 3.
  • 4.
    Self-Introduction I am ZiyuGong also go by Oliver. I received Bachelor’s degree in Accounting from China University of Mining and Technology(Beijing). In the seconded year of university, I went to Industrial and Commercial Bank of China(Shangrao) Corporate Businiess Department as an intern, Supporting the account manager in loan marketing by visiting clients to know about their demands for bank funds and collect necessary documents required for loan approval. That’s where I experienced and comprehended marketing practice for the first time. In the third year of university, I interned in Everbright Securities Investment Banking Department, conducting corporate and financial due diligence for an IPO project, responsibilities included collecting and checking confirmation requests and analyzing financial statements. As an aspiring person willing to take challenges, I set my goal to devote myself to marketing analysis industry in the future. B.S. in Accounting |China University of Mining and Technology(Beijing) LinkedIn URL: https://www.linkedin.com/in/%E5%AD%90%E8%88%86-%E9%BE%9A-18b8101b7/ GitHub Repo Link: https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing Kaggle Notebook Link: https://www.kaggle.com/olivergong77/customer-segementation-zg2088
  • 5.
  • 6.
    Key Learnings The mostimportant thing I learned from this lesson was to use tools like Goole Data Studio, Github and Kaggle to analyze data. In terms of application, I learned to conduct hypothesis testing, such as T-test, Anova, Chi-Square and other testing methods. I also learned to analyze correlations and linear regression. In addition, I learned to apply k-mean Clustering and Hierarchical Clustering methods to segment customers. As for my professional growth, the role of this class is huge. The professor has repeatedly stressed the importance of applying what is taught in class to our future work. The application tools taught in the classroom are relatively advanced. If we use these tools for data analysis in future work, it will greatly improve the work efficiency and reliability, and also enhance our competitiveness.
  • 7.
  • 8.
    Session 1: NewDataset Supermarket XYZ has been operating since 2008 and business flourished until 2016. They have a large database but they do not use them to achieve better business solutions. Their annual revenues have declined 10% and it seems to stay that way every year. Through the membership card, Supermarket XYZ got some basic information about the customer like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data. Supermarket_CustomerMembers.csv: This dataset I used for analyzing the consumer age structure and spending and regression analysis. Supermarket XYZ Customer data: https://www.kaggle.com/sindraanthony9985/marketing-data-for-a- supermarket-in-united-states
  • 9.
    XYZ Supermarket ConsumerAge Structure and Spending score Analysis Report Session 2: Research Design and The Data XYZ Supermarket, whose annual revenue will drop by 10% starting in 2016, has a huge database and obtains basic information about 200 customers, such as Customer ID, age, gender, annual income and spending score. By analyzing the age structure and spending score of customers, I want to understand the consumption level of customers of different ages in XYZ supermarket, so that I can do better marketing accordingly in the future. According to the study, the spending score of 30-35 years old is relatively high, while that of 19-29 years old is relatively low. In the future XYZ supermarket should obtain more basic customer information to make the results more reliable. Google data studio report URL: https://datastudio.google.com/reporting/61b9ad02-8e3f-48c3-84f4-59a61a91ef07
  • 10.
    • Supermarket XYZCustomer data: • https://www.kaggle.com/sindraanthony9985/marketing-data-for-a-supermarket- in-united-states Through the membership card, Supermarket XYZ got some basic information about the customer like Customer ID, age, gender, annual income and spending score. • URL of Github report: https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing I conducted a linear regression analysis and found that ages affect the spending score and annual income did not affect the spending score. Session 3: Regression
  • 11.
    Conduct the Analysis--- Scatter Plot Since I wanted to know whether there was a linear regression relationship between the age and the Spending Score and whether there was a linear regression relationship between the annual income and the Spending Score , I made a scatter plot with the data. I found no linear regression relationship between these variables.
  • 12.
    Conduct the Analysis--- Regression Result Null hypothesis: β1=0 and β2=0 Result: The X1 P-value = 0 < 0.05: we conclude that at the significant level 0.05, we can reject the null hypothesis that β1=0. The X2 P-value = 0.931 > 0.05: we conclude that at the significant level 0.05, we can’t reject the null hypothesis that β2=0. Based on the previous step, I set the ages as the independent variable X1, the annual income as the independent variable X2, and the spending score as the dependent variableY. Then I made the null hypothesis and did a linear regression analysis.
  • 13.
    Conduct the Analysis--- Insights and making decisions It can be concluded from linear regressionanalysis that (1) the age affects the spending score. (2) The annual income does not affect the spending score. Therefore, if we want to increase customer’s spending score to make more revenue, we should further understand which age groups have higher consumption levels and market to different age groups.
  • 14.
    Assumptions Check Then I wentto check whether the six assumptions I use are likely to be satisfied. I found that one of them is not satisfied and five of them are satisfied. Assumption 4: The varianceof the error term is constant. This variancedoes not depend on the values assumed by X. We can see from the scatter plot below, the assumption 4 is satisfied.
  • 15.
    Assumptions Check Assumption 2: Themeans of all these normal distributions of Y, given X, lie on a straight line with slope b. We can see from the scatter plot on page 11, the assumption 2 is not satisfied. Assumption 1&3: The error term is normally distributed. For each fixed value of X, the distribution of Y is normal. The mean of the error term is 0. We can see from the diagram below , the assumption 1&3 is satisfied. Assumption 5: The error terms are uncorrelated. In other words, the observations have been drawn independently. Since our data is not time series data, the assumption 5 is satisfied. Assumption 6: The independent variables in X are not correlated. This is no issues of multi-collinearity. We can see the P-value=0.781 > 0.05, we conclude that at the significant level 0.05, we can’t reject the null hypothesis that the independent variables in X are not correlated. So the assumption 6 is satisfied.
  • 16.
    Further Research As a supermarket,we should expand the collection of customer data and add sample points to make the results more reliable. Then we should further understand which age groups have higher consumption levels and segment to different age groups. Finally, we should develop different marketing strategies for different age groups.
  • 17.
  • 18.
    Capstone Project Milestone2: Research Design and The Data
  • 19.
    Capstone Project Milestone3: Hypothesis Testing • Bank Marketing Data: https://data.world/data-society/bank-marketing-data The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. • G20 GDP Data: https://stats.oecd.org/index.aspx?queryid=33940# The annual GDP of each country for each quarter. • URL of Github report: https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing I use paired test, Spearman test and one-sample t-test to test the null hypothesis,the conclusion are all significant. Name: Oliver Gong ID number: N14152886 NetID: zg2088
  • 20.
    Since the twodifferent groups data are metric data and we need to test the correlation of GDP of different countries between the the fourth quarter of 2018 and 2019, we do the paired tests. Conclusion: The P-value=0 < 0.05: we conclude that at the significant level 0.05, we can reject the null hypothesis that the means of GDP per capita for the fourth quarter of 2018 and 2019 for all countries are the same. Three Hypothesis Tests --- Paired Tests Null hypothesis: the means of GDP per capita for the fourth quarter of 2018 and 2019 for all countries are the same.
  • 21.
    Since the normalequals False, we use the Spearman to test correlations. Null hypothesis: the GDP of the certain country in the fourth quarter of 2018 and 2019 is not correlated. Three Hypothesis Tests --- Spearman Tests Conclusion: The P-value < 0.05: we conclude that at the significant level 0.05, we can reject the null hypothesis that the GDP of the certain country in the fourth quarter of 2018 and 2019 is not correlated.
  • 22.
    Three Hypothesis Tests--- One-Sample T-test Since I want to test whether the means of one single group of data is true, I use One-Sample T-test to test the mean of balance. Null hypothesis: the mean of balance equals 300. Result: The P-value < 0.05: we conclude that at the significant level 0.05, we can reject the null hypothesis that the mean of balance equals 300.
  • 23.
    We want toknow the sample size of the research, so we set the cohen d, power and alpha to do the power analysis. Result: For a 0.77 cohen d effect size, a power of 0.80, and a type I error of 0.05, we need a sample size of 27 (for each group). There are 224 countries and regions in the world. Now we just compare the quarterly GDP of 20 countries. So our conclusions are not very strong. In the future, we should increase the sample size and obtain the GDP data of each quarter of all countries and regions in the world to make our conclusion more reliable. Power Analysis and Final Remarks
  • 24.
    • Customer Churn Prediction 2020 https://www.kaggle.com/c/customer-churn-prediction-2020 This competition is about predicting whether a customer will change telecommunications provider, something known as “churning”. The dataset contains 4250 samples. Each sample contains 19 features and 1 booleanvariable "churn" which indicates the class of the sample. • URL of Github report: https://colab.research.google.com/github/OliverGong77/NYU_Integrated_Marketing I conducted a linear regression analysis and found that total minutes of day calls and the total minutes of eve callsdid not affect the total minutes of night calls. Name: Oliver Gong ID number: N14152886 NetID: zg2088 Capstone Project Milestone 4: Regression
  • 25.
    Conduct the Analysis--- Scatter Plot Since I wanted to know whether there was a linear regression relationship between the total day minutes and total night minutes and whether there was a linear regression relationship between the total eve minutes and total night minutes, I made a scatter plot with the data.
  • 26.
    Conduct the Analysis--- Regression Result Null hypothesis: β1=0 and β2=0 Result: The X1 P-value > 0.05: we conclude that at the significant level 0.05, we can’t reject the null hypothesis that β1=0. The X2 P-value > 0.05: we conclude that at the significant level 0.05, we can’t reject the null hypothesis that β2=0. Based on the previous step, I set the total day minutes as the independent variable X1, the total eve minutes as the independent variable X2, and the total night minutes as the dependent variableY. Then I made the null hypothesis and did a linear regression analysis.
  • 27.
    Conduct the Analysis--- Insights and making decisions It can be concluded from linear regressionanalysis that (1) the total minutes of day calls does not affect the total minutes of night calls. (2) the total minutes of eve calls does not affect the total minutes of night calls. Therefore, if we want to retain customers, we should give discounts package to customers who call at different periods.
  • 28.
    Assumptions Check Then I wentto check whether the six assumptions I use are likely to be satisfied. I found that one of them is not satisfied and five of them are satisfied. Assumption 4: The varianceof the error term is constant. This variancedoes not depend on the values assumed by X. We can see from the scatter plot below, the assumption 4 is satisfied.
  • 29.
    Assumptions Check Assumption 2: Themeans of all these normal distributions of Y, given X, lie on a straight line with slope b. We can see from the scatter plot on page 2, the assumption 2 is not satisfied. Assumption 1&3: The error term is normally distributed. For each fixed value of X, the distribution of Y is normal. The mean of the error term is 0. We can see from the diagram below , the assumption 1&3 is satisfied. Assumption 5: The error terms are uncorrelated. In other words, the observations have been drawn independently. Since our data is not time series data, the assumption 5 is satisfied. Assumption 6: The independent variables in X are not correlated. This is no issues of multi-collinearity. We can see the P-value=0.388 > 0.05, we conclude that at the significant level 0.05, we can’t reject the null hypothesis that the independent variables in X are not correlated. So the assumption 6 is satisfied.
  • 30.
    Further Research As a telecommunicationscompany, we should find some factors that can significantly influence the customer churn rate in the future, and give correspondingrecommendations to reduce this factor.
  • 31.
    • Onlineretail customerclutering https://www.kaggle.com/hellbuoy/online-retail-customer-clustering Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. • URL of Kaggle Notebook: https://www.kaggle.com/olivergong77/customer-segementation-zg2088 I choose France retail customers’ data to do the Cluster Analysis. I use the K-Means Clustering and Hierarchical Clusteringto get the best k and find the target customer clusters which we need to pay attention to. Name: Oliver Gong ID number: N14152886 NetID: zg2088 Capstone Project Milestone 5: Clustering
  • 32.
    K-Means Clustering ---Finding the best k I choose France retail customers’ data to do the Cluster Analysis. When metric = “distortion”, I got k = 4; When metric = “silhouette”, I got k = 3; When metric= “calinski_harabasz”, I didn’t get a k. So I finally found the best k = 3
  • 33.
    K-Means Clustering ---Visualize the cluster with the best k and summarize By the RFM criteria, we should choose the customer clusters with a lower recency, a higher frequency and amount. From the K-means clustering results, we can see that see that customers with Cluster_Id=2 best fit the criteria. We can see that we k-Means Clustering returns 18 target customer.
  • 34.
    Hierarchical Clustering ---Linkage methods By following three Linkage methods, I draw the tree diagrams. Then I do the hierarchical clustering accordingto k=3.
  • 35.
    Hierarchical Clustering ---Visualize the cluster with the best k and summarize By the RFM criteria, we should choose the customer clusters with a lower recency, a higher frequency and amount. From the K-means clustering results, we can see that customers with Cluster_Labels=2 best fit the criteria. We can see that Hierarchical Clusteringreturns 2 target customer.
  • 36.
    Further Research We cansee that k-Means Clustering returns 18 target customer. We can see that Hierarchical Clustering returns 2 target customer, which is a much smaller group than the one that K-Means Clustering return. In the actual work, if there are only 2 clusters, the number of people surveyed will be relatively small and the results are not reliable enough. Therefore, I prefer to use the K-Means Clustering.
  • 37.