2. Yi Xu
Global Economics, International Trade, Securities, Investment
Yi Xu comes from Shangrao, Jiangxi. He graduated from UC Santa Cruz, majored in
Global Economics, during his internship at Yusi Education Technology as channel
manager assistant for proposing data-driven marketing strategies on the internet to
optimize business processes, he explored how data and analytics were transforming
critical areas of strategy, marketing, and operations. Besides, he realized that data
analytics skills were needed in future marketing. After that, he started his full time job
at China Galaxy Securities as a investment consultant, responsible for Maintaining the
relationship of some high net worth customers, and provided professional investment
and financial advices to customers to achieve their optimal allocation of asset goal.
During one year and six month at Securities company, he found out he wants to learn
more knowledge about analyzing data to support marketing decision, he wanted to be
equipped with stronger analytical skills to collect, organize, analyze, and disseminate
significant amounts of information. Therefore, he decided to apply for the master
program in Integrated Marketing.
B.A. in Global Economics | University of California, Santa Cruz
Email: yx2489@nyu.edu
LinkedIn:https://www.linkedin.com/in/yi-xu-65bb
92118/
Github:https://github.com/yx2489/NYU_Integrat
ed_Marketing
Kaggle NoteBook:
https://www.kaggle.com/yx2489/notebooks
4. Summary
For this class, I have learned how to use different methods to analyse my
data, I have chance get to know the Cryptocurrency Website which is
CoinMarketCap.com. As a Cryptocurrency player, this website allows me
to check the my cryptocurrency price real time. Moreover, This website
include sufficient data source for almost every cryptocurrency and its for
free! So that I can follow the cryptocurrency trend to a great extent. As for
my professional growth, I have learned how to analyse data by using
Hypothesis Testing, logit regression, and clustering. In my future career, I
think I will combine what I have learned in this class with my financial
skills from my previous major to achieve my career goal.
6. The data contains information from the 1990 California census. So
although it has huge different from current housing prices like the Zillow
Zestimate dataset, it does provide an accessible introductory dataset for
us get to know the California housing price back in 90s.
The data pertains to the houses found in a given California district include
Inland, Near Bay Area, Island, Near Ocean, 1 hour from ocean. The
columns are as follows, their names are pretty self explanitory: Longitude,
Latitude, Housing, Median age, Total_rooms, Total_bedrooms, Population,
Households, Median_income, Median house value, ocean_proximity.
New Dataset
https://www.kaggle.com/camnugent/california-housing-prices
8. Capstone 2 California Housing
Abstract: I use the data from California Housing back in 90s. I compare the house price in different
area in California, for example, in island, near bay, near ocean, inland. This graph shows how many
houses are in these different are and how is there price relate to the location. We can conclude that
most people in California like to live in less one hour to the ocean.
Link:https://datastudio.google.com/reporting/a83b
2abe-31ec-424c-8ac9-7103df25e64f
9. Part III: Your own market
research report
Session3
Capstone4:Logit Regression
10. Executive Summary
The data is from Kaggle
https://www.kaggle.com/camnugent/california-housing-prices
In this research I choose Logit Model to conduct the the influence of x on if People live on a
island in California during 90s.
X include housing median age, median income, and population.
P is probability of live on a island.
Result: Since this housing median
age, median income, and population
do not have significant influence on
if they live on a island. For further
research we need to test other
variables see if they have influence
on if they live on a island.
Capstone 4
11. Logit Regression Result
Summary: The three x variables P value is more than 0.05, so we can not reject
null hypothesis that they do not have significant influence on the whether they live
on island in California.
12. Evaluate The Result
Summary: The accuracy rate is 0.9995 and 0.9997. The precision is high based on
the test result.
13. Interpret the Result
Summary: If housing median age increase by 1, the odds ratio will increase
by 1.08. Therefore, the housing median age have great influence on if they
live on island.
14. Part VI: Appendix
•Capstone Project Milestone 2: Research Design and The Data
•Capstone Project Milestone 3: Hypothesis Testing
•Capstone Project Milestone 4: Regression
•Capstone Project Milestone 5: Clustering
15. DOGE vs OMG | Yi Xu
Abstract: DOGE and OMG are two popular tokens with relatively small volume and market cap. In this study, we explore the future potetial of these
two tokens by comparing their performance. OMG Network (first developed as OmiseGO) is a non-custodial, Layer 2 scaling solution for transferring
value on Ethereum. How the protocol processes transactions is centralized, but its Plasma-based design aims to decentralize network security
whileDogecoin (DOGE) is based on the popular "doge" Internet meme and features a Shiba Inu on its logo. The open-source digital currency was
created by Billy Markus from Portland, Oregon and Jackson Palmer from Sydney, Australia, and was forked from Litecoin in December 2013.
Dogecoin's creators envisaged it as a fun, light-hearted cryptocurrency that would have greater appeal beyond the core Bitcoin audience, since it
was based on a dog meme. After visualizing the data from Coin Market Cap, we find out that this two tokens Volume are similar before 2019, but
after 2020 we can see that OMG volume increased dramatically. Therefore, we believe OMG can be a profitable investment option for those who
want to invest in small maket cap and volume cryptocurrencies.
Capstone 2
16. Capstone 3
Summary Name:Yi Xu yx2489
In this research, I conduct Paired T-Test, Two Sample T-Test, for the Assumptions,
because it is normal distribution, so we can use pearson correlation.
https://data.world/data-society/bank-marketing-data
This data is related with directing marketing campaigns of a portuguese banking
institution.
https://stats.oecd.org/index.aspx?queryid=33940#
This data is Economy for OECD countries before and after COVID-19.
Github repo URL: https://github.com/yx2489/NYU_Integrated_Marketing
17. Conclusion: Since p-value is lower than 0.5, we can reject the null hypothesis, because the we
have the conclude that it is normal distribution, so we use pearson correlation.
Assumptions
18. Paired T-Test
Conclusion: Since p-value is lower than 0.05, we can reject the null hypothesis. Countries
economy in 2020 is lower than in 2018. We can know that COVID-19 had a negative impact for
OECD countries economy.
19. Two Sample T-Test
Conclusion: We can reject the null hypothesis that the mean of balance equals between those
who have loan and those who do not has loan load at 0.05 (or even 0.001) significant level.
20. Conclusion: For a 0.8 cohen d effect size, a power of 0.70, and a type I error of 0.05, we need a sample
size
of 20 (for each group).
21. Limitations and future research:
For assumption since p-value is lower than 0.5, we can reject the null hypothesis, because the we
have the conclude that it is normal distribution, so we use pearson correlation. For paired T-test,
since p-value is lower than 0.5, we can reject the null hypothesis, because the we have the conclude
that it is normal distribution, so we use pearson correlation. However, the sample size might not be
sufficient for other complicated test.
If in the future we need to test for the cohen d equal to 0.5, the sample size will be 50 instead of 20.
We can also expand the sample size if the future clients ask for more detailed test.
22. Executive Summary
The data is from Kaggle
https://www.kaggle.com/c/customer-churn-prediction-2020
Github URL: https://github.com/yx2489/NYU_Integrated_Marketing
In this research I choose Logit Model to conduct the probability of success.
X include total night charge, total night calls, and total night minutes.
P is probability of success.
Result: Since this total night
charge, total night calls, and total
night minutes do not have
significant influence on the number
of churns. For further research we
need to test other variables see if
they have influence on churns.
Capstone 4
23. Logit Regression Result
Summary: The three x variables P value is more than 0.05, so we can not reject
null hypothesis that they do not have significant influence on the number of churns.
24. Evaluate the Result
Summary: The accuracy rate is 0.86 and 0.85929. The precision is high based on the test result.
25. Interpret the Result
Summary: If total night minutes increase by 1, the odds ratio will increase by 1.261855.
Therefore, the total night minutes do not have great influence on the ration of churn.
26. Data Set:https://www.kaggle.com/hellbuoy/online-retail-customer-clustering
Kaggle Notebook URL: https://www.kaggle.com/yx2489/customer-segementation-yx2489
In the research, I will be using the online retail transnational dataset from France to build a RFM
clustering and choose the best set of customers which the company should target. I will use K-Mean
Clustering and Hierarchical Clustering to conduct my results. We can see that we k-Means Clustering
returns 18 target customer. We can see that Hierarchical Clustering returns 2 target customer for
customer cluster 2, which is a much smaller group than the one that K-Means Clustering return. And
We can see that Hierarchical Clustering still returns 2 target Customer for customer cluster 1.
K-Mean Clustering: K-means clustering is an effective way of non-hierarchical clustering. In this
method the partitions are made such that non-overlapping groups having no hierarchical relationships
between themselves.
Hierarchical Clustering: Hierarchical clustering is basically an unsupervised clustering technique
which involves creating clusters in a predefined order. The clusters are ordered in a top to bottom
manner.
Capstone 5
28. K-Means Clustering: Interpreting the Clustering
By the RFM criteria, we should choose the customer clusters with a lower recency, a higher
frequency and amount. From the K-means clustering results, we can see that see that
customers with Cluster_Id=0 best fit the criteria.
30. Hierarchical Clustering: Visualize the dendrogram (tree)
This is dendrogram visualize tree by Linkage Methods.
Single Linkage Complete Linkage Average Linkage
31. Hierarchical Clustering: Virtualize and Interprets Result
By the RFM criteria, we should choose the customer clusters with a lower recency, a higher
frequency and amount. From the K-means clustering results, we can see that customers with
Cluster_Labels=2 best fit the criteria of Low recency and high frequency whereas Cluster 1 fits the high
amount.
32. Hierarchical Clustering: Interpreting the Clustering
We can see that Hierarchical
Clustering returns 2 target
customer for customer cluster 2,
which is a much smaller group than
the one that K-Means Clustering
return.
We can see that Hierarchical
Clustering still returns 2 target
Customer for customer cluster 1.