MapReduce:
Beyond Word Count
Jeff Patti
https://github.com/jepatti/mrjob_recipes
What is MapReduce?
“MapReduce is a programming model for processing large
data sets with a parallel, distributed algorithm on a cluster.”
- Wikipedia
Map - given a line of a file, yield key: value pairs
Reduce - given a key and all values with that key from the
prior map phase, yield key: value pairs
Word Count
Problem: count frequencies of words in
documents
Word Count Using mrjob
def mapper(self, key, line):
for word in line.split():
yield word, 1

def reducer(self, word, occurrences):
yield word, sum(occurrences)
Sample Output
"ligula" 4
"ligula." 2
"lorem" 5
"lorem." 4
"luctus" 3
"magna" 5
"magna," 3
"magnis" 1
Monetate Background
● Core products are merchandising,
personalization, testing, etc.
● A/B & Multivariate testing to determine
impact of experiments
● Involved with >20% of ecommerce spend
each holiday season for the past 2 years
running
Monetate Stack
● Distributed across multiple availability zones
and regions for redundancy, scaling, and
lower round trip times
● Real time decision engine using MySQL
● Nightly processing of each days data via
Hadoop using mrjob, a python library for
writing mapreduce jobs
Beyond Word Count
● Activity stream sessionization
● Product recommendations
● User behavior statistics
Activity Stream Sessionization
Goal: collate user activity, splitting into different
sessions if user inactive for more than 5
minutes
Input format: timestamp, user_id
Collate user activity
def mapper(self, key, line):
timestamp, user_id = line.split()
yield user_id, timestamp

def reducer(self, uid, timestamps):
yield uid, sorted(timestamps)
Sample Output
"998" ["1384389407", "1384389417", "1384389422",
"1384389425", "1384390407", "1384390417",
"1384391416", "1384392410", "1384392416",
"1384395420", "1384396405"]
"999" ["1384388414", "1384388425", "1384389419",
"1384389420", "1384390420", "1384391415",
"1384391418", "1384393413", "1384393425",
"1384394426", "1384395416", "1384396415",
"1384396422"]
Segment into Sessions
MAX_SESSION_INACTIVITY = 60 * 5
...
def reducer(self, uid, timestamps):
timestamps = sorted(timestamps)
start_index = 0
for index, timestamp in enumerate(timestamps):
if index > 0:
if timestamp - timestamps[index-1] >
MAX_SESSION_INACTIVITY:
yield uid, timestamps[start_index:index]
start_index = index
yield uid, timestamps[start_index:]
Sample Output
"999"[1384388414, 1384388425]
"999"[1384389419, 1384389420]
"999"[1384390420]
"999"[1384391415, 1384391418]
"999"[1384393413, 1384393425]
"999"[1384394426]
"999"[1384395416]
"999"[1384396415, 1384396422]
Product Recommendations
Goal: For each product a client sells, generate
a ‘people who bought this also bought this’
recommendation
Input: product_id_1, product_id_2, ...
Coincident Purchase Frequency
def mapper(self, key, line):
purchases = set(line.split(','))
for p1, p2 in permutations(purchases, 2):
yield (p1, p2), 1

def reducer(self, pair, occurrences):
p1, p2 = pair
yield p1, (p2, sum(occurrences))
Sample output
"8" ["5", 11]
"8" ["6", 19]
"8" ["7", 14]
"8" ["9", 11]
"9" ["1", 20]
"9" ["10", 22]
"9" ["11", 21]
"9" ["12", 13]
Top Recommendations
def reducer(self, purchase_pair, occurrences):
p1, p2 = purchase_pair
yield p1, (sum(occurrences), p2)

def reducer_find_best_recos(self, p1, p2_occurrences):
top_products = sorted(p2_occurrences, reverse=True)[:5]
top_products = [p2 for occurrences, p2 in top_products]
yield p1, top_products

def steps(self):
return [self.mr(mapper=self.mapper, reducer=self.reducer),
self.mr(reducer=self.reducer_find_best_recos)]
Sample Output
"7"
"8"
"9"

["15", "18", "17", "16", "3"]
["14", "15", "20", "6", "3"]
["15", "17", "19", "6", "3"]
Top Recommendations
Multi Account
def mapper(self, key, line):
account_id, purchases = line.split()
purchases = set(purchases.split(','))
for p1, p2 in permutations(purchases, 2):
yield (account_id, p1, p2), 1

def reducer(self, purchase_pair, occurrences):
account_id, p1, p2 = purchase_pair
yield (account_id, p1), (sum(occurrences), p2)

2nd step reducer unchanged
Sample Output
["9", "20"]
["9", "3"]
["9", "4"]
["9", "5"]
["9", "6"]
["9", "7"]
["9", "8"]
["9", "9"]

["8", "14", "13", "10", "1"]
["2", "4", "16", "11", "17"]
["3", "18", "11", "16", "15"]
["2", "1", "7", "18", "17"]
["12", "3", "2", "17", "16"]
["18", "5", "17", "1", "9"]
["20", "14", "13", "10", "4"]
["18", "7", "6", "5", "4"]
User Behavior Statistics
Goal: compute statistics about user behavior
(conversion rate & time on site) by account and
experiment in an efficient manner
Input:
account_id, campaigns_viewed, user_id, purchased?,
session_start_time, session_end_time
Statistics Primer
With sample count, mean, and variance for
each side of an experiment we can compute all
the statistics our analytics package displays
Statistics Primer (cont.)
y = a sessions metric value, ex: time on site
● Sample count: count the number of sessions
that viewed the experiment
○ sum(y^0)

● Mean: sum the metric / sample count
○ sum(y^1)/sum(y^0)
Statistics Primer (cont.)
● Variance:

○ Variance = mean of square minus square of mean
○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2

For each side of an experiment we only need to
generate: sum(y^0), sum(y^1), sum(y^2)
Statistics by account
statistic_rollup/statistic_summarize.py
Sample Output
["8", "average session length"] [99, 24463, 7968891]
["8", "conversion rate"] [99, 45, 45]
["9", "average session length"] [115, 29515, 10071591]
["9", "conversion rate"] [115, 55, 55]
Statistics by experiment
statistic_rollup_by_experiment/statistic_summa
rize.py
Sample Output
["9", 0, "average session length"] [32, 8405, 3031009]
["9", 0, "conversion rate"] [32, 20, 20]
["9", 1, "average session length"] [23, 5405, 1770785]
["9", 1, "conversion rate"] [23, 14, 14]
["9", 2, "average session length"] [39, 9481, 2965651]
["9", 2, "conversion rate"] [39, 20, 20]
["9", 3, "average session length"] [25, 6276, 2151014]
["9", 3, "conversion rate"] [25, 13, 13]
["9", 4, "average session length"] [27, 5721, 1797715]
["9", 4, "conversion rate"] [27, 16, 16]
Questions?

?

Map reduce: beyond word count

  • 1.
    MapReduce: Beyond Word Count JeffPatti https://github.com/jepatti/mrjob_recipes
  • 2.
    What is MapReduce? “MapReduceis a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia Map - given a line of a file, yield key: value pairs Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
  • 3.
    Word Count Problem: countfrequencies of words in documents
  • 4.
    Word Count Usingmrjob def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)
  • 5.
    Sample Output "ligula" 4 "ligula."2 "lorem" 5 "lorem." 4 "luctus" 3 "magna" 5 "magna," 3 "magnis" 1
  • 6.
    Monetate Background ● Coreproducts are merchandising, personalization, testing, etc. ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of ecommerce spend each holiday season for the past 2 years running
  • 7.
    Monetate Stack ● Distributedacross multiple availability zones and regions for redundancy, scaling, and lower round trip times ● Real time decision engine using MySQL ● Nightly processing of each days data via Hadoop using mrjob, a python library for writing mapreduce jobs
  • 8.
    Beyond Word Count ●Activity stream sessionization ● Product recommendations ● User behavior statistics
  • 9.
    Activity Stream Sessionization Goal:collate user activity, splitting into different sessions if user inactive for more than 5 minutes Input format: timestamp, user_id
  • 10.
    Collate user activity defmapper(self, key, line): timestamp, user_id = line.split() yield user_id, timestamp def reducer(self, uid, timestamps): yield uid, sorted(timestamps)
  • 11.
    Sample Output "998" ["1384389407","1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"] "999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
  • 12.
    Segment into Sessions MAX_SESSION_INACTIVITY= 60 * 5 ... def reducer(self, uid, timestamps): timestamps = sorted(timestamps) start_index = 0 for index, timestamp in enumerate(timestamps): if index > 0: if timestamp - timestamps[index-1] > MAX_SESSION_INACTIVITY: yield uid, timestamps[start_index:index] start_index = index yield uid, timestamps[start_index:]
  • 13.
    Sample Output "999"[1384388414, 1384388425] "999"[1384389419,1384389420] "999"[1384390420] "999"[1384391415, 1384391418] "999"[1384393413, 1384393425] "999"[1384394426] "999"[1384395416] "999"[1384396415, 1384396422]
  • 14.
    Product Recommendations Goal: Foreach product a client sells, generate a ‘people who bought this also bought this’ recommendation Input: product_id_1, product_id_2, ...
  • 15.
    Coincident Purchase Frequency defmapper(self, key, line): purchases = set(line.split(',')) for p1, p2 in permutations(purchases, 2): yield (p1, p2), 1 def reducer(self, pair, occurrences): p1, p2 = pair yield p1, (p2, sum(occurrences))
  • 16.
    Sample output "8" ["5",11] "8" ["6", 19] "8" ["7", 14] "8" ["9", 11] "9" ["1", 20] "9" ["10", 22] "9" ["11", 21] "9" ["12", 13]
  • 17.
    Top Recommendations def reducer(self,purchase_pair, occurrences): p1, p2 = purchase_pair yield p1, (sum(occurrences), p2) def reducer_find_best_recos(self, p1, p2_occurrences): top_products = sorted(p2_occurrences, reverse=True)[:5] top_products = [p2 for occurrences, p2 in top_products] yield p1, top_products def steps(self): return [self.mr(mapper=self.mapper, reducer=self.reducer), self.mr(reducer=self.reducer_find_best_recos)]
  • 18.
    Sample Output "7" "8" "9" ["15", "18","17", "16", "3"] ["14", "15", "20", "6", "3"] ["15", "17", "19", "6", "3"]
  • 19.
    Top Recommendations Multi Account defmapper(self, key, line): account_id, purchases = line.split() purchases = set(purchases.split(',')) for p1, p2 in permutations(purchases, 2): yield (account_id, p1, p2), 1 def reducer(self, purchase_pair, occurrences): account_id, p1, p2 = purchase_pair yield (account_id, p1), (sum(occurrences), p2) 2nd step reducer unchanged
  • 20.
    Sample Output ["9", "20"] ["9","3"] ["9", "4"] ["9", "5"] ["9", "6"] ["9", "7"] ["9", "8"] ["9", "9"] ["8", "14", "13", "10", "1"] ["2", "4", "16", "11", "17"] ["3", "18", "11", "16", "15"] ["2", "1", "7", "18", "17"] ["12", "3", "2", "17", "16"] ["18", "5", "17", "1", "9"] ["20", "14", "13", "10", "4"] ["18", "7", "6", "5", "4"]
  • 21.
    User Behavior Statistics Goal:compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner Input: account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
  • 22.
    Statistics Primer With samplecount, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
  • 23.
    Statistics Primer (cont.) y= a sessions metric value, ex: time on site ● Sample count: count the number of sessions that viewed the experiment ○ sum(y^0) ● Mean: sum the metric / sample count ○ sum(y^1)/sum(y^0)
  • 24.
    Statistics Primer (cont.) ●Variance: ○ Variance = mean of square minus square of mean ○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2 For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
  • 25.
  • 26.
    Sample Output ["8", "averagesession length"] [99, 24463, 7968891] ["8", "conversion rate"] [99, 45, 45] ["9", "average session length"] [115, 29515, 10071591] ["9", "conversion rate"] [115, 55, 55]
  • 27.
  • 28.
    Sample Output ["9", 0,"average session length"] [32, 8405, 3031009] ["9", 0, "conversion rate"] [32, 20, 20] ["9", 1, "average session length"] [23, 5405, 1770785] ["9", 1, "conversion rate"] [23, 14, 14] ["9", 2, "average session length"] [39, 9481, 2965651] ["9", 2, "conversion rate"] [39, 20, 20] ["9", 3, "average session length"] [25, 6276, 2151014] ["9", 3, "conversion rate"] [25, 13, 13] ["9", 4, "average session length"] [27, 5721, 1797715] ["9", 4, "conversion rate"] [27, 16, 16]
  • 29.