Sephora's Winning Formula: An Exploration of Product Recommendations and Review Helpfulness Predictions

An Exploration of Sephora’s Winning Formula
Ke Li, Yuyan Wang, Xinyue Yan
November 30, 2018
Abstract
Speaking of make up, Sephora is the biggest multinational chain of personal care and beauty
e-commerce, and lip make-ups are undoubtedly the hottest item that every cosmetic users passion
for. It’s not hard to notice that whenever we log in with Sephora, there will always be a section for
recommended products. Good recommendations provide customers with highly relevant personal-
ized services which brings profit to Sephora and makes customers stay with Sephora. So we set our
first predictive task as recommending lip make-up product to users that they’re likely to purchase.
For this task we build a model to recommend most important, relevant products to users. We apply
user-based Collaborative Filtering methodology in finalizing the model.
Looking back on customer’s purchase process, rather than check out with whatever has been rec-
ommended, they usually will take a second thought on reviews that can help them to know more the
product. Since each product normally receives an average of 400 reviews, Sephora presents reviews
to users based on ‘helpfulness’, scoring from 0 to 1, and reviews with higher scores are listed in
the front. The score is decided by the number of ”Helpful” clicks and the number of ”Not Helpful”
clicks by those who reviewed the text. Helpful reviews can make the product appealing, while less
meaningful reviews may lead users close the window. Through exploring the data set we found that
at least 50 percent of reviews have never been clicked, though they might actually be helpfulness to
stimulate a purchase. So we built a model with LDA and linear regression to predict the helpful-
ness of reviews given the text, such that reviews can further assist recommendations, to attract users.
Keywords: Machine Learning, Text Mining, Natural Language Processing, Collaborative Fil-
tering, TF-IDF, LDA
1 Dataset
In our case, there’s no existing data set
thus we started our analysis with crawling
all the product information and reviews for
all lip make-up products on Sephora website
(https://www.sephora.com/shop/lips-makeup).
We conducted exploratory analysis about the
characteristics in order to better understand the
features of users, products, and reviews, to fur-
ther assist the design of our model in the follow-
ing sections.
1.1 Data Format
Our final data set includes 252,317 reviews from
175,434 users, about 5,318 lip make-up products.
Reviews and product information from json files
are embedded by page/record, and each record
has features involving two parts, (1) product
attributes encapsulated in Includes, including
id, name, brand, URL of image, color id of the
product, the number of comments certain prod-
uct has along with other distinct information for
identification; (2) user attributes encapsulated in
Results, which contains reviewer id, nickname,
time of the submission of reviews, personal fea-
tures, text of reviews with detailed information.
To simplify the process of analysis and maximize
the efficiency, we tend to extract part of features
for further exploration, which has been given in
Table 1.
1

Table 1: Data formula
name description
products_id id of each product, reviews with the same id are of the same product
color_id id of colors of the lipstick, one product owns at least one color id
category_id id of each category, one product could only be categorized into one type
description description of the product
review_statistics statistical values related to reviews (see below)
_recommended_count count of numbers that the product recommended by certain user
_average_overall_rating average number of overall ratings given by users
_total_review_count total numbers of reviews related to the product
_not_recommended_count count of not recommend reviews received by the product
_helpful_vote_count count of helpful labels received by overall reviews of the product
brand_id & name id and name of each brand
author_id unique id for the user/reviewer
results list of reviews
rating rating number given by the user (range from 1 to 5)
review_text text of the reviews
context_data_values attributes of reviewers, including age, skin type, skin tone, hair color, eye color
helpfuluess feedbacks from other user (1 denotes helpful, 0 denotes unhelpful)
user_nickname name of the user who submitted the review
Among all these features, the most valuable
ones should be those related to review text for
two main reasons. First of all, the preferences of
customers are expressed directly through com-
ments submitted by users, with either positive or
negative attitude towards the products. Apart
from feedback from original users, reviews can
also be considered as critical reference as new
users making decision of purchase. Thus it is
reasonable to attach great importance and pay
more attention to those attributes for better per-
formance of recommending system with higher
accuracy.
1.2 Exploratory Data Analysis
1.2.1 Description of user data
Based on the user data we collected aimed at
the customers who have submitted reviews, we
are able to conduct descriptive analysis on basic
features of users, including age, skin color, skin
tone, eye color and hair color, to gain a general
understanding of user characteristics.
According to the figure below, we are able to
conclude that the target users of sephora lipstick
products are ranging from 18 to 54 years old, ex-
plaining for 89.1% of total records. Given the age
group information a newly created user belonged
to, we are able to recommend items purchased
by other users in the same age group, and adjust
the frequency of advertisements accordingly.
Figure 1: Age distribution of users
Furthermore, to enhance the performance of
recommender system, it is necessary to take a
close look at user features of appearance as the
basis of user clustering and filtering. The statis-
tical results are shown below.
Figure 2: Eye Colors of users
Figure 3: Skin Types of users
Figure 4: Hair Colors of users
2

1.2.2 Description of product data
Based on the same record dataset, we could also
conduct descriptive analysis for products, facili-
tating the realization of text mining and LDA.
For the first step, we divide overall 620 kinds
of products into 7 groups on the basis of num-
bers of reviews attached to each one. Especially,
we identify the product with distinct product_id
instead of color_id considering the difficulty of
building relationship between the latter one and
review contents lacking necessary explanation or
description or colors, for example 1983931 and
2012706. With an average number of reviews of
each lipstick product being 479 pieces, we draw
the bar plot for distribution.
Figure 5: Distribution of Number of Reviews
Apart from number of reviews, the popularity
of brands should also be considered as a critical
feature while creating recommendation model.
Given overall brands involved in the dataset, we
create the word cloud while evaluating the de-
gree of popularity with number of reviews users
submitted.
Figure 6: Popularity of Brands
Furthermore, we try to determine the relation-
ship between length of text and helpfulness of cer-
tain reviews for helpfullness prediction. The re-
sults of exploration ,however, do not indicate sig-
nificant positive relationship. Though the value
of helpfullness increases as the text expanding,
the fluctuation of data reflected through standard
deviation increases accordingly( as shown in Fig-
ure 7), which requires further adjustment if the
length feature is used is the model.
Figure 7: Relationship between Review Length
and Helpfulness
We also explore the relationship between users
and items using reviewer as connection. The fig-
ure below depicts the distribution of customers
submitting comments under certain product be-
longed to total 18 categories.
Figure 8: Distribution of Reviews in Categories
Considering total amount of products being 620,
we are able to conclude that there exists overlap-
ping in products reviewed by customers, which
provide the foundation for our similarity calcula-
tion and recommendation metrics creation.
2 Recommend Products
Sephora differentiates itself from other beauty re-
tailers by looking for what was most important,
relevant to a customer. So it always recommend
products to users through sliding windows of its
website. For this task we build a recommenda-
tion model to recommend lipsticks to users. In
3

this task we split the whole data set of users pur-
chase history into 60% training set, 20% valida-
tion set for tuning parameters and 20% test set
for evaluate model performance.
2.1 Model Baseline
We want to have an idea of what to expect
from the system by trying the following baseline
method: Popular item. In the industry of cos-
metic, it’s natural that customers follow trend
and to buy popular items. This baseline method
recommends users with the most popular lip-
sticks: lipsticks with largest amount of sales. So
all users will receive exactly same set of recom-
mendations.
Figure 9: Prediction accuracy vs number of items
recommended on baseline model
From the figure above we can see that as we in-
crease number of recommended items, accuracy
of purchase prediction increases. And when we
recommend 600 popular items to each users, 90%
of users in test data will buy product we recom-
mend. But this doesn’t mean the baseline model
is a good recommendation strategy, increasing
number of items in recommendation bring more
cost which we will discuss in detail in evaluation
part later.
2.2 User-based Collaborative Fil-
tering
In this model we recommend to a user based on
the fact that the products have been liked by
users similar to the user. For example if user A
and B like the same lipsticks and a new lipsticks
comes out that A likes it, then we can recommend
that lipstick to B because A and B seems to like
the same products.
2.2.1 Utility Matrix
In our recommendation system we mainly focus
on two entities users and items. We have the
record that a user bought a certain item or not.
The data is represented as a utility matrix, giv-
ing for each user-item pair, a value 1 means that
the user bought corresponding item, a value of 0
means that he or she didn’t buy. We assume that
the matrix is sparse that we have no explicit in-
formation about user’s purchase behavior on the
item. And the goal of our recommendation sys-
tem is to predict ”?” mark to be 1 or 0.
R =





1 0 . . .
0 ? . . .
...
...
1 ?





items
users
2.2.2 Jaccard Similarity
Jaccard similarity takes number of preferences
common between two users into account. Two
users will be more similar when they have more
common related items.
Jaccard(Ui, Uj) =
|Ui ∩ Uj|
|Ui ∪ Uj|
Figure 10: Distribution of Jaccard coefficient
From the distribution of Jaccard similarity values
above, coefficient values are mostly pretty large,
even a lot are distributed above 0.9. This means
that in our data, users have very similar purchase
behavior on lipsticks, people tend to like prod-
ucts that others like as well. So our recommen-
dation system based on Jaccard similarity and
popular item can find the most relevant products
for each users. The system first recommend users
with items bought by people who have the high-
est Jaccard similarity. Then it will recommend
some popular items.
2.3 Model Evaluation
Like we mentioned early, if we keep adding prod-
ucts to recommendation system, certainly users
will very likely buy items we recommend even
4

based on the trivial baseline model. But recom-
mend many products will increase cost and de-
crease system efficiency: people may feel tired of
reading such many recommendations and our rec-
ommendations may not contribute to their pur-
chase behavior at all.
In this case we choose precision to measure how
accurate the system could predict users like to
purchase items or not and choose recall to mea-
sure how efficiently the system recommend a item
that user will like.
precision =
| {returned items} ∩ {relevant items} |
| {returned items} |
recall =
| {returned items} ∩ {relevant items} |
| {relevant items} |
And in order to weight precision and recall
equally in our evaluation, we use F1 metrics:
F1 = 2 ·
precision · recall
precision + recall
So in our validation set, we implemented our
model to recommend from 1 to 40 items to each
users. From the plot below, as we expect recall
keeps increasing as we increase number of items
to recommend, because each user has constant
relevant items, when we keep recommend there
must be some they like. Precision keeps decreas-
ing because it’s denominator getting large, and
the speed of we detecting relevant items is much
slower than speed of increasing returned items.
F1 score which is influenced by both recall and
precision, increases rapidly first then decreases
gradually.
Figure 11: F1, Precision, Recall vs Number of
items we recommend
In order to optimize our model to be efficient and
to balance precision and recall, we choose to re-
turn 10 items each time in our recommendation
model which gives highest F1 score in validation.
F1 score for final model on test data is 0.278,
increasing the baseline F1 score by 15% and in-
creasing the baseline recall rate by 5%.
3 Predict Helpfulness
This task is designed for generating an automatic
mechanism for scoring review helpfulness in or-
der to present more helpful reviews for future
customers. We extracted two attributes from
the original data set: the actual review text and
the helpfulness scores. As we stated in the ab-
stract, the majority of the reviews don’t contain
a helpfulness score, so we need to filter out those
”null” helpfulness value records. Afterwards,
we’re looking at more than 100000 records, thus
we still have enough data for further processing.
We split the filtered data set into 50% train-
ing set, 20% validation set for tuning parameters,
and 30% for test. MSE is used as the measure-
ment for model evaluation.
3.1 Model Baseline
Since we observed a scattered but still positive
trend between review length and score of helpful-
ness, the feature we used in baseline is the length
of the review text.We used validation set to find
the optimal threshold, the result is in the follow-
ing figure :
Figure 12: Find Optimal Threshold for Baseline
The optimal threshold for our baseline model is
0.004, thus the baseline expression is written as:
ReviewHelpfulness = ReviewLength ∗ 0.004
(1)
This model will be used for future model perfor-
mance evaluation through searching for the final
model in the following chapters.
3.2 TF-IDF Model
3.2.1 Model Design
In our case, reviews with more important words
across all reviews may have higher helpfulness
score, so we tried to specify ’helpfulness’ by
how ’important’ the words are in the review
by TF-IDF. TF-IDF is a a numerical statistic
5

widely used in information retrieval. Gener-
ally speaking, TF-IDF is composed of two el-
ements: first element is Term Frequency(TF),
which measures the frequency a single word ap-
pears in a text/document. The second one is In-
verse Document Frequency(IDF) represents the
offsets Term Frequency by measuring the num-
ber of texts/documents that this single word ap-
pears. Below are mathematical expressions that
we put in the model:
TF =
# of word in review
Total # of word in review
(2)
IDF = loge
# of review
{review ∈ reviews : word ∈ review}
(3)
Before creating the frequency matrix, we first
dropped the stop words, stemmed them and
transformed them into the TF-IDF representa-
tion. For implementation, we used TfidfVector-
izer in sklearn library for feature extraction. No-
tice here we also constrain max_features for TF-
IDF where only top max_features number of vo-
cabulary ordered by term frequency across the
training set will be considered. A large number
of features would potentially lead to over fitting
and a small number of features could not be ef-
fective enough to differentiate all the reviews, re-
sulting in an under fitting model. So we need to
tune the feature numbers for a better model.
3.2.2 Training and Validation
The training stage is straight-forward: We build a
sparse matrix from the TF-IDF features for train-
ing set and fit a linear regression model to help-
fulness. Here are two things that we should be
careful with: 1. The predicted helpfulness score
should only be within the range of [0 1.0] . We’re
use the following expression for prediction:
Score = max(min(1, linear_regression_output), 0)
(4)
2. The same pre-processing steps (dropping
stop words, stemming words, vectorizing words)
needs to be applied to validation and test set
before applying linear regression.
Once we have the linear model, we could
validate the our model with the validation set
and find the optimal number of features that
we should use for test data. The following fig-
ure shows how the MSE varies with different
max_features value:
Figure 13: Find Optimal Number of Features for
TF-IDF
We see that with approximately 800 number
of features(words), the TF-IDF model has the
best performance on the validation set.
3.2.3 Evaluation
The performances of TF-IDF model and baseline
model on the validation data set are in Table2:
Table 2: Results of TF-IDF on Test Set
Model MSE
Baseline Model 0.15002893
TF-IDF + Linear_Regression 0.11272753
The model using TF-IDF and linear regres-
sion has reduced the baseline MSE by 24.8%.
3.3 LDA Model
The second perspective to view helpfulness of
a review is the topic that the review discussed.
Here we used Latent Dirichlet Allocation method
to retrieve the latent topics behind the review
text. LDA is a method of Bayesian Learning.
It assumes there is a latent random variable de-
cides the review’s topic with distribution P(topic)
and based on the topic, the words in the review
text are derived by the conditional probability
P(word|topic). The task of LDA model is to
learn the prior probability P(topic) and condi-
tional probability table P(word|topic) from the
review text.
After the structure of the model is estab-
lished, we then explored if the topic matrix can
have better predictive performance on helpful-
ness comparing to both baseline and TF-IDF
model. Comparing to TF-IDF model, the LDA
model’s feature are concerned with the topic dis-
cussed in the text. Intuitively, the topic might
have direct impact on how customers feel how
helpful the review is. Since a review discussed a
lot about how wonderful the chat with a beauty
adviser was, rather than how much she likes the
6

lipstick for its moisturizing effect will be consid-
ered less helpful.
Similarly, before creating the matrix for top-
ics, we vectorized the review text into words, and
used CountVectorizer to transform them into a
matrix representation.
3.3.1 Training and Validation
We start with a 15 topic model and computed
the average helpfulness for each topic:
Figure 14: Relationship between topic and help-
fulness
From the figure we observed that not all top-
ics are equally helpful. If dominant topic of a
given review is topic 12, then it’s likely to have
a lower helpfulness score than topic 2 and 19.
Thus a model that can make topics differ most
in helpfulness is what we’re looking for.
The number of topics in an LDA model is the
key to decide whether these topics make sense
or not. Therefore we did grid search starting
from 15 to 25 with step of 1, and trained the
transformed matrix using LDA. After getting
the output of LDA, we trained our initial model
with linear regression on the LDA output and
the score of helpfulness. Table 3 shows the pa-
rameters of different LDA models and the MSE
on validation set.
Table 3: MSE of predicting usefulness using different number of topics
Number of Topic MSE on Validation Set Log Likelihood Model Perplexity
15 0.131865 -5839527.734060916 423.2268901699496
16 0.129842 -5862523.743103024 433.42771242741054
17 0.127493 -5871409.85324313 437.43504911060467
18 0.122465 -5871373.935690109 437.4187771560674
19 0.113974 -5874825.186300991 438.9850874537416
20 0.110873 -5891637.942804745 446.6959467805993
21 0.110842 -5912014.697301019 456.22314533000224
22 0.110797 -5911136.048998294 457.80816999900003
23 0.110731 -5936159.312419161 459.43154819404225
24 0.110727 -59371523.23856239 467.7375391299828
25 0.110706 -5976159.232312414 477.34532429124632
There are three parameters to consider when
choosing a LDA model. We prefer a LDA model
with the minimum MSE, but also a high log likeli-
hood and low model perplexity. Taking all these
into consideration, based on results of our vali-
dation set we find that the model with 20 top-
ics has the optimal performance, because MSE
doesn’t decrease much after 20 while model per-
plexity keeps going up. Therefore we used this
number in our final LDA model.
Table4 has the top 10 words of different topics
of the 20 topics:
7

Table 4: Top 10 Words in 20-Topic model
Top 10 Words in LDA Topic Model
0 pencil touch lasted ago box month incredibly wore play drugstore
1 love colour lip absolutely power staying color look doe beautiful
2 product price worth size packaging great high totally scrub good
3 lipstick color formula liquid shade dry like drying doe transfer
4 wa sephora try store did color went bought tried looked
5 gloss lip sticky color love like just look shine glossy
6 review brand wish sephora cute kat soon von does better
7 lip product use dry balm day like time using work
8 long lip lasting liner recommend pigmented highly product love creamy
9 color perfect love shade lipstick nude skin tone wear red
10 packaging texture small warm case great smooth purse hand color
11 lip color stain doe apply dry hour like just need
12 natural look tint happy drinking looking lip eating color getting
13 sheer pigment hydrating rich color shimmer expected summer able job
14 lip dry soft year help winter skin literally greasy having
15 brand sephora store went buy kat cute bought von multiple
16 wa color did try looked just like got really thought
17 review high price product 10 did application minute slight opinion
18 lip balm product use ve used day using time tried
19 love color stay lipstick lip doe pencil great day long
In the 20-topic model, we can find the top
words in each topic can make sense if put to-
gether, for example the in topic 5 the word
”sticky” is highly weighted mining this topic is
describing the . See also topic 17 is discussing
the review and the rating of the product.
3.3.2 Evaluation
The model using LDA and linear regression has
a MSE on validation of 0.110842, which reduced
the baseline MSE by 26.1%.
3.4 Model Optimization
Since we’re projecting both the TF-IDF and LDA
transformed matrix to helpfulness using linear re-
gression, we’d want to regularize the regression
output.
Table 5: Influence of Review Length on Helpful-
ness
Review Length Score Mean Score Std
0 to 200 0.238 0.0340
200 to 400 0.207 0.0244
400 to 600 0.189 0.0395
600 to 800 0.183 0.0719
800 to 1000 0.176 0.104
1000 to 1500 0.174 0.119
> 1500 0.186 0.176
Notice that when review length increases, the
standard deviation of helpfulness also increases.
So we tried to design a regularization/penalty
term based on length:
Reg = −lambda ∗ 10−
5 ∗ (log(Review_Length))
(5)
Then we applied this regularization term to both
models, results are shown below:
Figure 15: Optimization of TF-IDF Model
Figure 16: Optimization of LDA Model
8

The optimal lambda for TF-IDF model is 7, and
the optimal lambda for LDA model is 4.
3.5 Model Evaluation
Performance of all models on validation set and
test set are listed below:
Table 6: Model Comparison
Model Validation Set Test Set
Baseline 0.15002893 0.19213458
TF-IDF 0.11148997 0.13033192
LDA 0.11061232 0.12134558
4 Related Studies
For the similar objective, the author of paper
Collaborative Embeddings for Lipstick Recom
mendations applied GloVe algorithm for building
recommender system, interpreting the problem
into matrix factorisation. Taking users’ brows-
ing session with products contexts as input data,
the author decomposed the log result of product
co-occurrence matrix into Embedding matrix E
and Bias vector b, which could be expressed as
below:
Learned by mini-batch stochastic gradient de-
scent, parameters E and b are used for revealing
potential relationship between items/products,
assisting building a fully collaborative item-based
recommender system to predict users purchase
behavior. Particularly, the author provided em-
beddings algebra to avoid possible bias caused
by the impact of brand attributes and promote
better product discovery.
In the paper Improved Collaborative Filteri
ng Algorithm Based on Multifactor Fusion
and User Features Clustering, the authors
put forward an improved algorithm based on
multifactor fusion and user features clustering,
calculating user similarity by user rating sim-
ilarity and items category preference. Mean-
while, Marko and Yoav compared pure collab-
orative with pure content-based systems and
further discussed the problem of cold-start in
Content − based, collaborativerecommendation,
facilitating our selection of similarity calculation
function.
Also in Margaret Fu’s Recommendation Syst
em for E −commerce services, the author com-
bines classification and collaborative filtering to
predict the category of movie that users like. In
this paper,the system is based on users commu-
nity graph from their past watching history. The
model they use determines similarity between
users by edges between two users. They give
different weight to different neighbours of a user,
and opinion from neighbours with high weight
values more than from a low weighted neigh-
bour. And by using these methodology they
build a Naive Bayes Learning Algorithm to de-
fine the probability of each user will like a certain
movie genre. All of those methods are adaptable
while the framework is easy to be extended, facil-
itating the achievements of Sephora’s user-based
recommendation.
5 Summary
In this project, we first crawled data from
Sephora’s website. Since it is using dynamic
loading techniques, it’s quite challenging to re-
trieve all the reviews in json format. After get-
ting all the data, we explored the properties of
the overall data set.
Next, we built two models base our research
of recommending products to users based on past
activities, and predicting helpfulness of reviews in
order to present more helpful reviews for future
users. After updating features and models, we
arrive at our final conclusion:
• To recommend lip make-up products, we
build recommendation system to look for
efficient way to predict users’ purchase be-
havior by using user based collaborative fil-
tering. The model analyzes users’ past pur-
chase behavior and compare purchases be-
tween users to predict purchase behavior in
the future thus gives recommendation.
• To predict review helpfulness, we extracted
topics behind review text using LDA, then
project the output to helpfulness score with
linear model which is penalized by review
length. Compared to the baseline, the fi-
nal model succesefully decreased MSE by
36.8%.
9

Sephora's Winning Formula: An Exploration of Product Recommendations and Review Helpfulness Predictions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sephora's Winning Formula: An Exploration of Product Recommendations and Review Helpfulness Predictions

Similar to Sephora's Winning Formula: An Exploration of Product Recommendations and Review Helpfulness Predictions (20)

Recently uploaded

Recently uploaded (20)

Sephora's Winning Formula: An Exploration of Product Recommendations and Review Helpfulness Predictions