- Random forest modeling showed that history (dollars spent in the last year) and recency (months since last purchase) were the top predictors of whether a customer would visit the website after an email campaign. This intuitively makes sense, as customers who have spent more and shopped more recently are more loyal.
- The email campaign would be most successful in getting visits from these loyal customers with high history and low recency. The company should focus the campaign primarily on these customers with personalized messaging and promotions.
- Understanding which customers are most likely to visit also allows the company to better plan inventory, promotions, and pricing for an expected increase in demand from these predicted visitors. Targeting loyal customers can increase average cart size
2. Team Adams
Contents
• Premise and Understanding the Data
• Need for Sampling
• Different Models and its Performances
• Business Insights from Predictions
• Business Recommendation from Predictions
4. The Premise
The purpose of our project is to understand the efficacy of an email
advertising campaign. The efficacy of an email ad campaign is best judged
primarily by the increase in the number of visits by the customers and more
importantly by the rise in the sales after the campaign. We will attempt to
understand what influences in a customer visiting the website.
This will also help the business to personalize email content with relevant merchandise based
on their history of purchases like recency, amount spent, merchandise preference (cost
bracket, men/women, etc.), channel preference, etc.
Further, this will also help other departments like Supply Chain, Pricing and Merchandise
Planners to meet the predicted demand and run appropriate promotions to maximize
margins and volumes.
5. The Data
• This dataset contains a total of 44,800 customers who last purchased within twelve months.
The customers were involved in an email test.
• To make it an unbiased study, data set contained customers who were divided as follows.
• 1/3 were randomly chosen to receive an email campaign featuring Mens merchandise.
• 1/3 were randomly chosen to receive an email campaign featuring Womens merchandise.
• 1/3 were randomly chosen to not receive an email campaign.
6. Customers’ Visit Data
Recency
Months since last purchase.
History_Segment
Newbie
Channel
Segment
VisitZip_Code
Womens
Mens
History
Actual dollar value spent in
the past year.
Describes the channels the customer
purchased from in the past year.
describes the email campaign the
customer received: Mens Email, Womens
Email, No Email
1 = Customer visited website in the
following two weeks after the email
campaign
1 = New customer in the past
twelve months
Categorization of dollars spent in the
past year
1 = customer purchased Mens
merchandise in the past year
1 = customer purchased Womens
merchandise in the past year
Classifies zip code as Urban,
Suburban, or Rural
The Data
7. Understanding the Data
We can see that as stated in premise the number of
customers who received different kinds of emails is a equal.
8. Understanding the Data
•Although in absolute numbers,
suburban visitors were the most
visited after the campaign, in
terms of percentage the largest
block is of rural customers.
•From the graph to the right, we
can clearly see that people who
were contacted by mail were the
ones who visited in greater
numbers.
9. Understanding the Data
The maximum revenue generated in the last year has been in the 200-300$ range. We can
use this information to stock more merchandise in this range
11. In business terms, having a visit percentage of 14.6% (with respect to non-visits) is good.
However, while applying the machine learning models, its advised to have balanced data
of visits and non-visits (Most machine learning algorithms work best when the number of
samples in each class are about equal. This is because most algorithms are designed to
maximize accuracy and reduce error)
Imbalanced Data - Need for Sampling
12. Imbalanced Data - Need for Sampling
Here is a contrast between Model Performance before and after Sampling the data.
Please note, we have employed up-sampling ( oversampling minority class)
As it can be observed, that the performance
evaluating scores have improved ‘in overall’ for
both Logistics and Random Forests.
Its interesting to note high ‘accuracy’ in case of
Logistics Regression before doing sampling.
We’ll take a closer look on what it means, and
why it is misleading in the next slide.
13. Logistics Regression before sampling
The accuracy and Precision seems to be good. However,
f1 score and recall_score are very poor.
Accuracy is how close a measurement is to the correct
value for that measurement. The precision of a
measurement system is refers to how close the
agreement is between repeated measurements
Consider a bullseye. High accuracy and low precision is
like having all the hits closer to the center, but none are
close to each other; this is an example of accuracy
without precision.
The recall_score talks about sensitivity. A low score
implies the model has predicted a very less number of
customers who have actually visited the website post
campaign; which is a very important metric for the
business.
14. The model has predicted 7658 customers who haven’t visited the
website correctly. However, it has just predicted 1 customer who has
actually visited the website! (the same reflects poor f1 and recall scores)
From a business point of view, this is a very serious concern. We are
basically missing out on all those customers who are interested in
visiting on our website and our predictions fail to recognize them.
Hence, thereby the business decisions based on this will not offer any
benefits to them and eventually end up losing them.
From a inventory point of view, the predictions mislead the merchandise
planners by giving them less demand. What if most of these customers
who visit the website go ahead and buy the merchandise? This will
quickly lead to out of stock and supply chain issues, which hits on
revenues and supply chain expenditures.
Logistics Regression before sampling
Understanding Confusion Matrix
TPFN
TN FP
16. Logistics Regression after sampling
As it can be seen, the model has significantly improved. Though the accuracy has decreased, the other
parameters have increased.
Please note that in this business use case, just higher accuracy (Accuracy = TP+TN/TP+FP+FN+TN) is
NOT a great measure to evaluate the performance. As we can understand, True Positives are more
important than True Negatives. I.e; what is the use by not predicting customers who actually visit the
website than correctly predicting customers who will not visit the website.
The company will lose(in terms of investment) a lot when compared with wrongly predicting customers
who do not visit.
18. Different Models and its Performance
*Please note: We have also tried SVC Classifiers, but its taking too long to run the code.
1. Random Forest
2. Logistics Regression
3. kneighbors Classifier
4. Gaussian Naive Bayes
5. ADaBoost Classifier
6. Support Vector Classifier (SVC)*
Based on the performance
parameters, we have chosen Random
Forest for further evaluation
19. Drawing insights from Random Forest
Of all the Predictors, History, Recency are the Top two that influence a lot on if the
customer is going to visit the website or not.
21. 1. History is the actual dollar spent in the last year
2. Recency is months since the last purchase
Intuitively, this makes sense. In general, if a customer likes the merchandise, he prefers to shop big (by
and more often.
According to the prediction, the email campaign will be more more successful for the customers with
‘history purchase value’ and ‘recency’. In other words, the company should focus more on the customers
who fall under this bracket.
An other angle to look at it would be these are the ‘loyal’ customers (who buy big and shop quite often)
who will visit the website and more likely to purchase again.
Business Insights from Predictions
22. Business Recommendation from Predictions
Following are the areas in which the company can focus to make more revenue from this campaign:
• Loyalty/Personalization: This email campaign can be focused primarily to all those ‘loyal’ customers
with a more personalized approach. Based on various factors like the past purchase activity, customer
buying pattern, likes and dislikes, demographical influence, etc, we can recommend right products in
the email content. Giving them loyalty points which can be redeemed will also attract the customers.
• Up-Selling/Cross-Selling: Since we know these specific set of customers (with high history and
recency) values are more likely to visit the website, the company can run promotions(based on past
purchase activity) like Buy One Get One (BOGO), % off on products, etc. This will increase their
shopping cart size and thereby results more sale
• Inventory/Stock Availability: Since we know what these set of loyal customers prefer, its safe to have
required stock at given demographical area (if we know where the customer generally ships to) like DC/
Store, which will also reduce delivery time and supply chain costs
Please note, the above recommendations are for those specific set of customers from the given dataset who
have high ‘history’ and ‘recency’ (whom, we call as more loyal customers).