Customer reviews are an important feature on Amazon’s vast array of products. Many customers rely heavily on the honest reviews of past users during purchasing decisions. Currently, the only way to regulate the quality of these reviews is for other users to voluntarily thumbs up/down a review as ‘helpful’ or ‘not helpful’. It is in the best interest of Amazon (and potential customers) to be shown the most helpful reviews first and de-prioritize (or flag) useless reviews. Thus, we wanted to try and create a model that could successfully predict whether or not customers would find user product reviews helpful. With such a model, Amazon would be able to better prioritize user reviews displayed on product pages from the moment a review is posted.
Predicting Helpfulness of User-Generated Product Reviews Through Analytical Models
1. Predic'ng
Helpfulness
of
Amazon’s
User-‐Generated
Product
Reviews
Ankita
Kaul
&
Nicholas
Baladis
MIT
Sloan
–
Spring
2015
2. Project
Mo'va'on
Amazon
prioriAzes
product
reviews
that
customers
deem
‘helpful’,
only
a@er
customers
have
voluntarily
voted
so.
Customers
can
voluntarily
vote
here
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
3.
4. …Amazon
could
predict
which
reviews
are
helpful,
the
moment
they
are
posted?
Product
Ra5ng
Helpfulness
score
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
5. Data
Galore*
Our
data
consisted
of
Amazon
user-‐generated
product
reviews,
spanning
all
product
categories,
and
spanning
a
Ame
of
18
years.
Each
‘observaAon’
is
a
customer’s
review.
• Reviewer
ID
• Helpfulness
RaAng
• Product
ID
• Product
Price
• Timestamp
of
review
• Review
Prose
• Score
Data
Structure:
~35M
Reviews,
All
Categories
~1.2M,
Electronics
Categories
~18K,
Only
Reviews
with
>10
votes
Downsize
Downsize
We
had
to
downsize:
*Data
procured
from
Stanford
University
J.
McAuley
and
J.
Leskovec.
Hidden
factors
and
hidden
topics:
understanding
ra5ng
dimensions
with
review
text.
RecSys,
2013.
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
6. Analysis
Approach
The
Setup
The
Methodology
Dependent
variable
Is
a
review
helpful
or
not?
• ‘Yes’
if
>75%
voters
agree
• Binary
variable
Independent
variables
Pre-‐Exis'ng
from
data
set:
• Product
Price
• Overall
product
raAng
Newly
calculated:
• Word
count
of
review
prose
• Readability
grade-‐level
score
On
unclustered
data
set
• Linear
Regression
• LogisAc
Regression
• CART
• Cross-‐Validated
CART
• Random
Forest
• Bag
of
Words
On
clustered
data
set
• LogisAc
Regression
• CART
• Cross-‐Validated
CART
• Random
Forest
• Bag
of
Words
Flesch-‐Kincaid
method:
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
7. PredicAons
on
Unclustered
Data
Set
Methodology
Accuracy
Baseline
74.95%
Linear
Regression
R2
=
0.273
LogisAc
Regression
81.44%
CART
80.88%
Cross-‐V
CART
81.84%
Random
Forest
81.94%
BoW
&
LogisAc
Reg
81.08%
BoW
&
CART
79.80%
BoW
&
Cross-‐V
CART
78.16%
BoW
&
Random
Forest
82.08%
score >= 2.5
price < 210
work >= 0.5
score >= 1.5
price < 30
FALSE
FALSE
FALSE
FALSE TRUE
TRUE
yes no
BoW
&
CART
Tree
Our
predic've
models
look
promising:
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
8. Clustering
The
Data
Set
Cluster
1
-‐
Eloquent
&
wordy
• Highest
word
count
• Highest
grade
score
Cluster
2
–
Cheap
products
&
less
wordy
• Lowest
price
• Low
word
count
Cluster
3
–Worse
products
&
shortest
reviews
• Lowest
word
count
• Lowest
product
score
Cluster
4
–
The
‘average’
group
• Average
in
all
variables
Cluster
5
–
Expensive
products
&
least
arAculate
reviews
• Highest
price
• Low
grade
score
15%
35%
31%
14%
5%
05000001000000
Cluster Dendrogram
Height
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
9. Cluster
Baseline
Accuracy
Best
Performing
Accuracy
Best
Performing
Methodology
Cluster
1
90.52%
90.52%
Baseline
Cluster
2
85.24%
86.08%
Random
Forest
Cluster
3
65.31%
76.74%
Bag
of
Words
&
Random
Forest
Cluster
4
68.63%
82.24%
Bag
of
Words
&
Cross-‐
Validated
CART
Cluster
5
70.31%
84.34%
LogisAc
Regression
Clustered
Data
Set
Results
No
improvement
through
modeling
+14%
improvement
Cluster-‐then-‐predict
total
accuracy
=
76.81%
Clustering
provided
us
mixed
results
on
our
models:
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
10. Bag
of
Words
Text
AnalyAcs
+
CART
Examples
on
Clustered
Set
score >= 3.5
wordcoun >= 58
grade_sc >= 5.4
wordcoun >= 96
epson >= 2.5
might >= 0.5
keep >= 0.5
pretti >= 0.5
wordcoun < 102
FALSE
FALSE TRUE FALSE
FALSE
FALSE
FALSE
FALSE TRUE
TRUE
yes no
score >= 3.5
wordcoun >= 50 wordcoun >= 124
score >= 2.5
speaker < 1.5
fine >= 0.5
chang < 0.5
window >= 0.5
issu >= 0.5
real >= 0.5
FALSE TRUE
FALSE
FALSE
FALSE
FALSE
FALSE TRUE
TRUE
TRUE
TRUE
yes no
Cluster
4
Cluster
5
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan
12. Our
best
performer
was
Bag
of
Words
+
Random
Forests
on
the
complete
data
set
The
cluster-‐then-‐predict
methodology
did
not
beat
modeling
the
enAre
set
However,
clustering
gave
us
other
interesAng
results:
• Clusters
1,2,4,5
beat
even
our
best
models
we
developed
on
the
enAre
data
set
• Cluster
1
had
such
a
high
baseline
(90.52%),
no
model
is
needed
• Cluster
5
had
a
+14%
improvement,
higher
than
any
other
model
74.95%
(Baseline)
82.08%
(BoW
+
RF)
74.95%
(Baseline)
76.81%
(Cluster-‐then-‐Predict)
Amazon
can
predict
the
helpfulness
of
reviews
at
the
moment
they
are
posted
with
reasonable
accuracy
with
a
2-‐step
model
(1)
cluster,
2)
predict
by
cluster).
By
applying
such
analy'cs,
they
can
poten'ally
flag
unhelpful
reviews
at
'me
of
pos'ng
and
help
develop
a
be_er
decision
making
experience
for
customers.
Conclusions:
Ankita
Kaul
&
Nick
Baladis
|
MIT
Sloan