2. Self-Introduction
Hi, this is Roxie Zhang. I like rock music and stand-up
show, and I’m currently considering forming a rock
band. I would like to pursue a career in brand
marketing and it is rather necessary to utilize data
analyzing tools or learn how to work data analysts
nowadays. It was such a pleasure to study with my
classmates this semester. Now here’s my
presentation for reviewing this course, Statistical
Measurement, Analysis & Research.
Github Repo link:
https://colab.research.google.com/github/kexinez/NYU_Integrated_Marketing
Kaggle Notebook link: https://www.kaggle.com/kexinezhang/women-clothing-
ecommerce-analysis
Linkedin URL: www.linkedin.com/kexin-zhang-972823166
3. Lessons
What I’ve learned:
• I think hypothesis testing will be useful in finding
correlation of two items co-occurrence in a customer’s
basket. Regression models can be utilized in predicting the
amount of items sold or other figures. More importantly,
I’m happy to learn how to find the most valuable customer
segment. I can even visualize the analysis result to others
now.
• I guess the most valuable treasure I obtained from this class
is that I’m no more that unconfident in data analysis. I used
to freak out when I heard of the potential risk of getting in
touch with data processing stuff. Now I would really get on
hand and try if I can make some progress or learn it by
searching on Internet.
4. Research Design
• Dataset: Individual medical costs billed by health insurance of over 1330
beneficiaries in the US and their basic information.
• URL: https://www.kaggle.com/artaseyedian/predicting-health-insurance-
charge-with-tidymodels
• Key variables include: Age of the insurant, sex, BMI, number of children,
smoke or not, region, individual medical costs billed by health insurance
• Research Design: To explore whether the medical costs are correlated
with the number of children and BMI index or not, I will conduct linear
regression models using Github. The research will help marketers find the
customers with lower medical costs so that the insurance company is
more likely to find this group so as to maintain the costs low.
5. Data Preparation
• Sample: 13,38 insurants of a health insurance
• From this research, I conducted a preliminary data inspection which categorized our
customers according to their region in the US. What’s more, we can see that
southeastern region not only occupied the largest share in our customers, it also has
the highest percent of smokers compared to the non-smokers
• https://datastudio.google.com/reporting/4c2b0b69-782f-4690-ab78-42b944d1fa27
6. Reproduce
Regression analysis
The 1st graph shows there is no clear
linear relationship between bmi and
charges.
The 2nd graph shows there is no linear
relationship between age and bmi.
7. • According to the regression results, the p value of the variable (children) is
0.015 (<0.05), thus we can reject the null hypothesis that the charges are not
correlated with children.
• the p value of the variable (bmi) is 0 (<0.05), thus we can reject the null
hypothesis that the charges are not correlated with bmi.
• It seems that children and bmi are positively correlated with charges
• children is more influential on the charges but it also has higher stand error.
8. Insights
• With 95% confidence level, we can say the two
variables (children and bmi) have influence on
charges.
• This gives us clues on what insurants of high charges
look like in the two aspects (more children and
higher in BMI index). That is to say, in order for
insurance company to make profits, we should
absorb insurants who have fewer children and lower
BMI.
9. Assumptions Check
Assumptions:
• 1. Satisfied: The error term is
almost normally distributed.
• 3. Not satisfied: The mean of
the error term is not 0.
10. Assumptions Check
• 2. Satisfied. The means of all these
normal distributions of Y, given X, lie
on a straight line with slope b.
• 4. Not satisfied: The variance of the
error term is not constant.
• 5. Satisfied: The error terms are
uncorrelated. In other words, the
observations are not drawn
independently.
6. Satisfied: The independent variables in X are not correlated. This is no
issues of multi-collinearity.
P-value = 0.631 > 0.05, we can conclude that at 0.05 significant level, we
cannot reject the null hypothesis that the independent variables are not
correlated.
11. Further Research
• From the linear regression analysis I’ve done, no clear linear
relationship can be obtained, but they do have a positive
effect on the decision variable. Therefore, a guessing is that
there are other variables are relevant in the level of insurants’
medical bills having not be inspected.
• The recommendation is to investigate on more variables such
as smoker or non-smoker, diet, sleeping habit and daily
workout and also analyze these aspects in combination.
Because some diseases do not generate just for one reason.
Instead, they are results of convoluted factors. We need to
obtain a more accurate statistical model using variables in
order to predict customers’ medical bills.
14. Milestone 3
Limitations of the research:
Even though the p-value is under 0.05, the statistical power is too low. So the deviation
of statistics can be large.
This research only implies marital condition is correlated to the duration of calls, but did
not find the quantitative relationship between them. Besides, duration’s relationship
with other dimensions of information is also important for us to predict duration and
target at valuable customers, which needs further research such as regression analysis.
15. The plots represent the relationship between the number of total international charge and the total
international minutes.
Result: There seems to be a linear relationship between x and y and they are positively correlated.
Milestone 4: Regression
18. Milestone 5: clustering
Lowest recency: Cluster 1&2 Highest frequency: Cluster 1 Highest amount: Cluster 1
By the RFM criteria, we should choose the customer clusters with a lower recency,
higher frequency and amount. From the K-means clustering results, we can see that
customer with Cluster Labels=1 best fit the criteria.