Map reduce: beyond word count

7,575 views

Published on

This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )

Map Reduce: Beyond Word Count by Jeff Patti

Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline

Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.

Published in: Technology, Business

Map reduce: beyond word count

  1. 1. MapReduce: Beyond Word Count Jeff Patti https://github.com/jepatti/mrjob_recipes
  2. 2. What is MapReduce? “MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia Map - given a line of a file, yield key: value pairs Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
  3. 3. Word Count Problem: count frequencies of words in documents
  4. 4. Word Count Using mrjob def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)
  5. 5. Sample Output "ligula" 4 "ligula." 2 "lorem" 5 "lorem." 4 "luctus" 3 "magna" 5 "magna," 3 "magnis" 1
  6. 6. Monetate Background ● Core products are merchandising, personalization, testing, etc. ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of ecommerce spend each holiday season for the past 2 years running
  7. 7. Monetate Stack ● Distributed across multiple availability zones and regions for redundancy, scaling, and lower round trip times ● Real time decision engine using MySQL ● Nightly processing of each days data via Hadoop using mrjob, a python library for writing mapreduce jobs
  8. 8. Beyond Word Count ● Activity stream sessionization ● Product recommendations ● User behavior statistics
  9. 9. Activity Stream Sessionization Goal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes Input format: timestamp, user_id
  10. 10. Collate user activity def mapper(self, key, line): timestamp, user_id = line.split() yield user_id, timestamp def reducer(self, uid, timestamps): yield uid, sorted(timestamps)
  11. 11. Sample Output "998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"] "999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
  12. 12. Segment into Sessions MAX_SESSION_INACTIVITY = 60 * 5 ... def reducer(self, uid, timestamps): timestamps = sorted(timestamps) start_index = 0 for index, timestamp in enumerate(timestamps): if index > 0: if timestamp - timestamps[index-1] > MAX_SESSION_INACTIVITY: yield uid, timestamps[start_index:index] start_index = index yield uid, timestamps[start_index:]
  13. 13. Sample Output "999"[1384388414, 1384388425] "999"[1384389419, 1384389420] "999"[1384390420] "999"[1384391415, 1384391418] "999"[1384393413, 1384393425] "999"[1384394426] "999"[1384395416] "999"[1384396415, 1384396422]
  14. 14. Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation Input: product_id_1, product_id_2, ...
  15. 15. Coincident Purchase Frequency def mapper(self, key, line): purchases = set(line.split(',')) for p1, p2 in permutations(purchases, 2): yield (p1, p2), 1 def reducer(self, pair, occurrences): p1, p2 = pair yield p1, (p2, sum(occurrences))
  16. 16. Sample output "8" ["5", 11] "8" ["6", 19] "8" ["7", 14] "8" ["9", 11] "9" ["1", 20] "9" ["10", 22] "9" ["11", 21] "9" ["12", 13]
  17. 17. Top Recommendations def reducer(self, purchase_pair, occurrences): p1, p2 = purchase_pair yield p1, (sum(occurrences), p2) def reducer_find_best_recos(self, p1, p2_occurrences): top_products = sorted(p2_occurrences, reverse=True)[:5] top_products = [p2 for occurrences, p2 in top_products] yield p1, top_products def steps(self): return [self.mr(mapper=self.mapper, reducer=self.reducer), self.mr(reducer=self.reducer_find_best_recos)]
  18. 18. Sample Output "7" "8" "9" ["15", "18", "17", "16", "3"] ["14", "15", "20", "6", "3"] ["15", "17", "19", "6", "3"]
  19. 19. Top Recommendations Multi Account def mapper(self, key, line): account_id, purchases = line.split() purchases = set(purchases.split(',')) for p1, p2 in permutations(purchases, 2): yield (account_id, p1, p2), 1 def reducer(self, purchase_pair, occurrences): account_id, p1, p2 = purchase_pair yield (account_id, p1), (sum(occurrences), p2) 2nd step reducer unchanged
  20. 20. Sample Output ["9", "20"] ["9", "3"] ["9", "4"] ["9", "5"] ["9", "6"] ["9", "7"] ["9", "8"] ["9", "9"] ["8", "14", "13", "10", "1"] ["2", "4", "16", "11", "17"] ["3", "18", "11", "16", "15"] ["2", "1", "7", "18", "17"] ["12", "3", "2", "17", "16"] ["18", "5", "17", "1", "9"] ["20", "14", "13", "10", "4"] ["18", "7", "6", "5", "4"]
  21. 21. User Behavior Statistics Goal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner Input: account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
  22. 22. Statistics Primer With sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
  23. 23. Statistics Primer (cont.) y = a sessions metric value, ex: time on site ● Sample count: count the number of sessions that viewed the experiment ○ sum(y^0) ● Mean: sum the metric / sample count ○ sum(y^1)/sum(y^0)
  24. 24. Statistics Primer (cont.) ● Variance: ○ Variance = mean of square minus square of mean ○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2 For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
  25. 25. Statistics by account statistic_rollup/statistic_summarize.py
  26. 26. Sample Output ["8", "average session length"] [99, 24463, 7968891] ["8", "conversion rate"] [99, 45, 45] ["9", "average session length"] [115, 29515, 10071591] ["9", "conversion rate"] [115, 55, 55]
  27. 27. Statistics by experiment statistic_rollup_by_experiment/statistic_summa rize.py
  28. 28. Sample Output ["9", 0, "average session length"] [32, 8405, 3031009] ["9", 0, "conversion rate"] [32, 20, 20] ["9", 1, "average session length"] [23, 5405, 1770785] ["9", 1, "conversion rate"] [23, 14, 14] ["9", 2, "average session length"] [39, 9481, 2965651] ["9", 2, "conversion rate"] [39, 20, 20] ["9", 3, "average session length"] [25, 6276, 2151014] ["9", 3, "conversion rate"] [25, 13, 13] ["9", 4, "average session length"] [27, 5721, 1797715] ["9", 4, "conversion rate"] [27, 16, 16]
  29. 29. Questions? ?

×