Data mining for causal inference: Effect of recommendations on Amazon.com

Data mining for
causal inference
AMIT SHARMA
Postdoctoral Researcher, Microsoft Research
(Joint work with JAKE HOFMAN and DUNCAN
WATTS, Microsoft Research)
http://www.amitsharma.in
@amt_shrma
1

My research
Analyzing the effect of online systems
◦ Recommender systems [WWW ’13, EC ’15, CSCW ‘15]
◦ Social news feeds [CSCW ‘16]
◦ Web search
Methodological
◦ Threats to large-scale observational studies [WWW ’16b]
◦ Mining for natural experiments [EC ‘15]
◦ New identification strategies suited for fine-grained data
◦ Testing assumptions for validity of an instrumental variable
◦ Gaps between prediction and understanding [WWW ’16a, ICWSM ‘16]
2

How much do they
change user behavior?
4

Naively, up to 30% of traffic
comes from recommendations
5

Naively, up to 30% of traffic
comes from recommendations
“Burton Snowboard, a sports retailer, reported
that personalized product recommendations
have driven nearly 25% of total sales since it
began offering them in 2008. Prior to this,
Burton’s customer recommendations consisted
of items from its list of top-selling products.”
6

Example: product browsing on
Amazon.com

Counterfactual browsing: no
recommendations

Problem: Correlated demand may
drive page visits, even without
recommendations

The problem of correlated
demand
Demand
for winter
accessories
Visits to
winter hat
Rec. visits
to winter
gloves
14

Goal: Estimate the causal
effect
Causal
Convenience
OBSERVED CLICK-THROUGHS WITHOUT RECOMMENDER
Convenience
?
15

Ideal experiment: A/B Test
Treatment (A) Control (B)
But, experiments:
may be costly
hamper user experience
require full access to the system
16

Using natural variations to
simulate an experiment
18

Studying sudden spikes,
“shocks” to demand for a book
[Carmi et al. 2012]
19

The same author’s recommended
book may also have a shock
20

Past work
Uses statistical models to control for confounds
Carmi et al. [2012], Oestreicher and Sundararajan [2012] and Lin [2013]
construct “complementary sets” of similar, non-recommended
products.
Garfinkel et. al. [2006] and Broder et al. [2015] compare to model-
predicted clicks without recommendations.
But,
1. These assumptions are hard to verify.
2. Finding examples of valid shocks requires ingenuity
and restricts researchers to very specific categories
21

This talk: Using data mining for
natural experiments
I. Data-driven instrumental variables
“Shock-IV” method: Mining for sudden spikes (“shocks”) in data
II. General data-driven identification strategy for
time series data
“Split-door” criterion: Generalizing the idea of shocks
Throughout, we will use Amazon’s recommendation system as an
example.
22

I. Shock-IV: Mining
for valid natural
experiments
23

Distinguishing between
recommendation and direct traffic
All visits to a
product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Proxy for unobserved demand
24

The Shock-IV strategy:
Searching for valid shocks
? ?
25

The Shock-IV strategy: Filtering
out invalid shocks
26

Why does it work? Shock as an
instrumental variable
Demand
Focal
visits (X)
Rec.
visits (Y)
Sudden
Shock
Direct
visits (Y)

Computing the causal
estimate
Increase in
recommendation
clicks ( )
Causal CTR (
*Same as Wald estimator
for instrumental variables
Increase in
visits to focal
product ( )

Application to Amazon.com,
using Bing toolbar logs
•
•
•
Sept 2013-May 2014

Recreating sequence of page
visits by a user

visits by a user
Timestamp URL
2014-01-20
09:04:10
http://www.amazon.com/s/ref=nb_sb_nos
s_1?field-keywords=George%20saunders
2014-01-20
09:04:15
http://www.amazon.com/dp/0812984250/
ref=sr_1_1
2014-01-20
09:05:01
ref=pd_sim_b_2

visits by a user
Timestamp URL
2014-01-20
09:04:10
http://www.amazon.com/s/ref=nb_sb_no
ss_1?field-keywords=George%20saunders
2014-01-20
09:04:15
ref=sr_1_1
2014-01-20
09:05:01
ref=pd_sim_b_2
User searches for
George Saunders
User clicks on the first
search result
User clicks on the
second recommendation

I. Weekly and seasonal patterns in
traffic, nearly tripling in holidays

II. 30% of all pageviews come
through recommendations

III. Books and eBooks are the
most popular categories by far

IV. Apparel and shoes see a
substantially higher fraction of
visits through recommendations

Shock-IV: Finding shocks in
user visit data
We look for focal products with large and sudden
increases in views relative to typical traffic.
Size of shock exceeds:
◦ 5 times median traffic
◦ Shock exceeds 5 times the previous day's traffic and 5 times the
mean of the last 7 days.
Shocked product has:
◦ Visits from at least 10 unique users during the shock
◦ Non-zero visits for at least five out of seven days before and after
the shock
38

Shock-IV: Ensuring exclusion
restriction
Recommended product (Y) should have constant
direct visits during the time of the shock.
(1-β): Ratio of maximum 14-day variation in visits to a
recommended product to the size of the shock for the focal
product.
Direct traffic to Y is
stable relative to
the shock to the
focal product.
β = 1 Direct traffic to Y is
no less varying
than the shock to
focal product.
β = 0
39

How to choose 𝛽?
Accept
RejectSelect 𝛽 = 0.7

Using the method, obtain
>4000 natural experiments!

Estimating the causal
clickthrough rate (𝜌)

Causal click-through rate by
product category

Estimating fraction of observed
click-throughs that are causal
Compare the number of estimated causal clicks to
all observed recommendation clicks (non-shock
period).
45

Only a quarter of the observed
click-throughs are causal
At β = 0.7, only 25% of
recommendation traffic is
caused by the recommender.

Generalization?
Shocks may be due to
discounts or sales
Lower CTR may be due to
the holiday season
47

Local average treatment effect
(LATE), not fully generalizable
Shocked products are not a representative sample of
all products, nor are the users who participate in them.
• Fortunately, Shock-IV method covers roughly one-fifth of
all products with at least 10 visits on any single day.
• Causal estimates are consistent with experimental
findings (e.g., Belluf et. al. [2012])
48

Summary: Shock-IV method
I. Mining for instruments allows us to study a much larger
sample of natural experiments.
II. Fine-grained data allowed us to test for exclusion
restriction directly.
A simple, scalable method for causal inference.
◦ Can used for improving recommender systems through causal metrics.
◦ Can be applied to other domains, such as online ads.
◦ Can be used for finding potential instruments.
49

II. Generalizing Shock-IV:
“Split-door” criterion
50

Let’s have a look at the model
again
Demand
Focal
visits (X)
Rec.
visits (Y)
Sudden
Shock
Direct
visits (Y)

Focal Product Recommended Product
Accept
Accept
54

The split-door criterion
Instead of searching for shocks,
Check whether direct traffic for Y is
independent of visits to X.
Demand
Focal
visits (X)
Rec.
visits (Y)
Direct
Visits
(YD)
55

More formal: Why does it
work?
Demand
Focal
visits (X)
Rec.
visits (Y)
Direct
Visits
(YD)

Two possibilities, both remove
the effect of common demand
Demand
Focal
visits (X)
Rec.
visits (Y)
Dir. visits
(YD)
Demand
Focal
visits (X)
Rec.
visits (Y)
Dir. visits
(YD)

Sidenote: Split-door criterion
generalizes Shock-IV
By capturing shocks, we were essentially capturing
notion of independence between X and 𝑌𝐷
Split-door will admit all valid shocks, as also other
variations.
58

Applying to logs from Amazon
recommendations
1.
2.

Summary: A general
identification criterion
Split-door criterion admits a broader sample of
natural experiments than shocks.
Automatically tests for valid identification. Can be
used whenever 𝑌𝑑 is separable.
Applications: Evaluate the relationship between
any two timeseries: e.g. social media and news, ads
and search.
61

Conclusion
Majority of traffic from recommendations may be
not causal, simply convenience.
Two data-driven methods:
• Shock-IV: An IV-based method for mining
exclusion-valid instruments from observational
data
• Split-door: A general identification strategy for
time series data.
62

More generally, data mining can
augment causal inference
methods
Hypothesize about a
natural variation
Argue why it resembles a
randomized experiment
Compute causal effect
Develop tests for
validity of natural
variation
Mine for such valid
variations in
observational data
Compute causal
effect
63

Thank you!
AMIT SHARMA
MICROSOFT RESEARCH
@amt_shrma http://www.amitsharma.in
Hypothesize about a
natural variation
Argue why it resembles a
randomized experiment
Develop tests for validity of
natural variation
Mine for such valid variations
in observational data
Sharma, A., Hofman, J. M., & Watts, D. J. (2015). Estimating the causal impact of
recommendation systems from observational data. In Proceedings of the Sixteenth ACM
Conference on Economics and Computation.
64

Data mining for causal inference: Effect of recommendations on Amazon.com

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Data mining for causal inference: Effect of recommendations on Amazon.com

Similar to Data mining for causal inference: Effect of recommendations on Amazon.com (20)

More from Amit Sharma

More from Amit Sharma (6)

Recently uploaded

Recently uploaded (20)

Data mining for causal inference: Effect of recommendations on Amazon.com