2. S ELF-INTRODUCTION
Ye Tian comes from Weifang, Shandong. He was majoring in Human
Resource Management in Indiana University Bloomington and had
the honor to get the dean’s list several times during the
undergraduate. He started his internship at Guotai Junan Securities.
At that time, he analyzed and aggregated the factors such as the
interest rate level of government bonds and policy financial bonds as
well as the evaluation of local bonds in the secondary market. Also,
evaluated the project construction and bond market demand,
determined the term of special bonds, provided consulting services
for special bonds such as issuance pricing, registration and custody,
listing and trading. The second internship experience was at Allianz
Group. He accomplished on-the-spot investigation and valuation
analysis of the furnishing company TUBAO VENEER, by collecting and
evaluating macro and industrial data from various reports of the
company.Github Repo Link: https://github.com/YeTian08/NYU-Integrated-
Marketing
Kaggle Notebook Link:
https://www.kaggle.com/yetianyt2237/customer-segementation-
yt2237
LinkedIn URL: https://www.linkedin.com/in/野-田-3788011b7/
3. SUMMARY
• An overview of statistical analysis used in marketing
• Evaluate qualitative and quantitative data to maximize customer relationships
integral to program planning
• Apply statistical techniques, correlating to short- and long-term profitability, to
prospect new customers
• Create statistically sound marketing tests to determine profitability and return on
investment (ROI)
• Design market research surveys to evaluate the potential success of new product
and service offerings
• Analyze syndicated research to help gauge product success by segments and
target audiences
• Determine campaign effectiveness through statistical modeling techniques
4. RESEARCH DESIGN AND THE DATA
Marketing-Customer-Value-Analysis
Link: https://datastudio.google.com/reporting/d8708102-d56d-43bb-928a-
5. EXECUTIVE SUMMARY
Data sources: https://www.kaggle.com/rouseguy/bankbalanced
Summary: The dataset in UCI is imbalanced. A balanced sample was taken from
that dataset to create this. Features remain the same as the original one.
The training dataset contains 11162 samples. Each sample contains 17features
and 1 Boolean variable "churn" which indicates the class of the sample.
Regression model: X= balance; Y= duration
Result: It has no linear correlation between balance and duration.
URL to my Github: https://github.com/YeTian08/NYU-Integrated-Marketing.git
6. SCATTER PLOTS AND SUMMARY FOR THE PLOTS
It has no linear correlation
between balance and duration.
7. REGRESSION RESULT
Balance: P=0.170, which is
greater than 0.05. We cannot
reject the hypothesis.
Duration: P=0.000, which is
lower than 0.05. We can
reject the hypothesis.
8. INSIGHT
• There is non-linear relationship between balance and duration.
• It is no meaningful to do regression by using X and Y. It should find
other two variances do the linear relationship. The study needs to use
the new data of balance and duration to get the linear relationship.
9. ASSUMPTION
X and Y has non-linear relationship, X is
not on the Y.
The residuals is constant, so the
assumption is correct. This variance is
not depending on X assumption value.
The X=residual is not
normally distributed. For the
X, the distribution of Y is
normal.
11. Summary
Data Sources 1 : https://data.world/data-society/bank-marketing-data
Summary: The data is related with direct marketing campaigns of a
Portuguese banking institution. The marketing campaigns were based on
phone calls.
Data Sources 2: https://data.world/data-society/bank-marketing-data
Summary: COVID-19 containment measures weighed heavily on economic
activity in the second quarter of 2020, with unprecedented falls in real gross
domestic product (GDP) in most G20 countries.
Hypothesis and Result: This report uses Paired T-Test, Person and Sample T-
Test to the hypothesis, and the result is that the hypothesis is rejected.
My Github URL: https://github.com/YeTian08/NYU-Integrated-Marketing.git
CAPSTONE PROJECT MILESTONE
3: HYPOTHESIS TESTING
12. PAIRED T-TEST
Conclusion: For P-value is 0.0, a power of 0.38 and the hypothesis is
rejected which the GDP of 2019 Q2 is same as 2020 Q2.
13. ASSUMPTIONS:
Conclusion: The mean of 2019 Q2 and 2020 Q2 is
not correlated. The P-Value is 0.0, and the power is
1.0. The hypothesis is rejected, and the Pearson
should be the method.
14. TWO SAMPLE T-TEST
Conclusion: We can reject the null hypothesis that the mean of balance
equals between those who have house load and those who do not has
house load at 0.05 significant level.
15. Conclusion: For a 0.34 Cohan d effect size, a power of 0.8, and a type 1 error of
0.05, we need a sample size of 137.
Limitation: Any effect that can impact the internal validity of a research study
may bias the results and impact the validity of statistical conclusions reached.
It deal with groups and aggregates only.
Further Research: We need to collect more data from different countries and
regions. Also, we need to control the variables.
17. Executive Summary (Ye Tian yt2237)
Data sources: https://www.kaggle.com/c/customer-churn-prediction-2020
Summary: This competition is about predicting whether a customer will change
telecommunications provider, something known as "churning".
The training dataset contains 4250 samples. Each sample contains 19 features
and 1 boolean variable "churn" which indicates the class of the sample.
Regression model: X= total day/eve minutes; Y= total night minutes.
Result: It has no linear correlation between total day minutes and total night
minutes. Also, it has no linear correlation between total eve minutes and total
night minutes.
URL to my Github: https://github.com/YeTian08/NYU-Integrated-Marketing.git
CAPSTONE PROJECT MILESTONE
4: REGRESSION
18. SCATTER PLOTS AND SUMMARY FOR THE
PLOTS
It has no linear correlation between
total day minutes and total night
minutes.
It has no linear correlation
between total eve minutes and
total night minutes.
19. Total day charge: P=0.521, which is greater than 0.05. We cannot
reject the hypothesis.
Total eve charge: P=0.365, which is greater than 0.05. We cannot
reject the hypothesis.
Regression result
20. INSIGHT
There is non-linear relationship between total day minutes and total night call .
There is non-linear relationship between total day minutes and total night call.
It is no meaningful to do regression by using X and Y. It should find other two
variances do the linear relationship. The study needs to use the new data of day,
night and eve minutes to get the linear relationship.
21. Assumption
X and Y has non-linear relationship, X is not on the Y.
The residuals is constant, so the assumption is correct.
This variance is not depending on X assumption value.
The X=residual is normally
distributed. For the X, the
distribution of Y is normal.
23. Executive Summary
Data Source: https://www.kaggle.com/sunshineluyaozhang/customer-segementation-lz2520
Segmentation based on RFM has been in use for over 50 years especially
by direct marketers to target a subset of their customers, save marketing
costs, and improve profits. RFM is based on three pillars of customer
attributes: Recency of purchase, Frequency of purchase and Monetary
value of purchase. The objective is to identify target customers with recent
purchase records (low recency), frequency purchase record (high
frequency) and have spent good money (monetary is high).
Method: Clustering and RFM factors, K-Mean Clustering and Hierarchical Clustering.
Result: The country I choose is France to build a RFM clustering and choose the best set of customers.
K-Mean Clustering:
K=3 Cluster Id=1 57 target customers return
Hierarchical clustering: 2 target customers return.
The URL to my Kaggle Notebook: https://www.kaggle.com/yetianyt2237/customer-segementation-yt2237
CAPSTONE PROJECT
MILESTONE 5: CLUSTERING
24. K-Means Clustering
The best K=3
The lowest recency=1
The highest frequency=1
The highest amount=1
Based on RFM Criteria, the cluster labels=1
can best fit the criteria.
27. The lowest recency=2
The highest frequency=2
The highest amount=1
Based on RFM Criteria, the cluster
labels=2 can best fit the criteria.
Hierarchical Clustering returns 2 target
customers.