Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- MapReduce by examples by Andrea Iacono 124593 views
- Hadoop MapReduce Fundamentals by Lynn Langit 88975 views
- Hadoop Real Life Use Case & MapRedu... by Anju Singh 13654 views
- MapReduce in Simple Terms by Saliya Ekanayake 44377 views
- MapReduce Design Patterns by Donald Miner 15495 views
- Mapreduce Algorithms by Amund Tveit 24711 views

7,575 views

Published on

Map Reduce: Beyond Word Count by Jeff Patti

Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline

Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.

No Downloads

Total views

7,575

On SlideShare

0

From Embeds

0

Number of Embeds

25

Shares

0

Downloads

304

Comments

0

Likes

10

No embeds

No notes for slide

- 1. MapReduce: Beyond Word Count Jeff Patti https://github.com/jepatti/mrjob_recipes
- 2. What is MapReduce? “MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia Map - given a line of a file, yield key: value pairs Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
- 3. Word Count Problem: count frequencies of words in documents
- 4. Word Count Using mrjob def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)
- 5. Sample Output "ligula" 4 "ligula." 2 "lorem" 5 "lorem." 4 "luctus" 3 "magna" 5 "magna," 3 "magnis" 1
- 6. Monetate Background ● Core products are merchandising, personalization, testing, etc. ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of ecommerce spend each holiday season for the past 2 years running
- 7. Monetate Stack ● Distributed across multiple availability zones and regions for redundancy, scaling, and lower round trip times ● Real time decision engine using MySQL ● Nightly processing of each days data via Hadoop using mrjob, a python library for writing mapreduce jobs
- 8. Beyond Word Count ● Activity stream sessionization ● Product recommendations ● User behavior statistics
- 9. Activity Stream Sessionization Goal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes Input format: timestamp, user_id
- 10. Collate user activity def mapper(self, key, line): timestamp, user_id = line.split() yield user_id, timestamp def reducer(self, uid, timestamps): yield uid, sorted(timestamps)
- 11. Sample Output "998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"] "999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
- 12. Segment into Sessions MAX_SESSION_INACTIVITY = 60 * 5 ... def reducer(self, uid, timestamps): timestamps = sorted(timestamps) start_index = 0 for index, timestamp in enumerate(timestamps): if index > 0: if timestamp - timestamps[index-1] > MAX_SESSION_INACTIVITY: yield uid, timestamps[start_index:index] start_index = index yield uid, timestamps[start_index:]
- 13. Sample Output "999"[1384388414, 1384388425] "999"[1384389419, 1384389420] "999"[1384390420] "999"[1384391415, 1384391418] "999"[1384393413, 1384393425] "999"[1384394426] "999"[1384395416] "999"[1384396415, 1384396422]
- 14. Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation Input: product_id_1, product_id_2, ...
- 15. Coincident Purchase Frequency def mapper(self, key, line): purchases = set(line.split(',')) for p1, p2 in permutations(purchases, 2): yield (p1, p2), 1 def reducer(self, pair, occurrences): p1, p2 = pair yield p1, (p2, sum(occurrences))
- 16. Sample output "8" ["5", 11] "8" ["6", 19] "8" ["7", 14] "8" ["9", 11] "9" ["1", 20] "9" ["10", 22] "9" ["11", 21] "9" ["12", 13]
- 17. Top Recommendations def reducer(self, purchase_pair, occurrences): p1, p2 = purchase_pair yield p1, (sum(occurrences), p2) def reducer_find_best_recos(self, p1, p2_occurrences): top_products = sorted(p2_occurrences, reverse=True)[:5] top_products = [p2 for occurrences, p2 in top_products] yield p1, top_products def steps(self): return [self.mr(mapper=self.mapper, reducer=self.reducer), self.mr(reducer=self.reducer_find_best_recos)]
- 18. Sample Output "7" "8" "9" ["15", "18", "17", "16", "3"] ["14", "15", "20", "6", "3"] ["15", "17", "19", "6", "3"]
- 19. Top Recommendations Multi Account def mapper(self, key, line): account_id, purchases = line.split() purchases = set(purchases.split(',')) for p1, p2 in permutations(purchases, 2): yield (account_id, p1, p2), 1 def reducer(self, purchase_pair, occurrences): account_id, p1, p2 = purchase_pair yield (account_id, p1), (sum(occurrences), p2) 2nd step reducer unchanged
- 20. Sample Output ["9", "20"] ["9", "3"] ["9", "4"] ["9", "5"] ["9", "6"] ["9", "7"] ["9", "8"] ["9", "9"] ["8", "14", "13", "10", "1"] ["2", "4", "16", "11", "17"] ["3", "18", "11", "16", "15"] ["2", "1", "7", "18", "17"] ["12", "3", "2", "17", "16"] ["18", "5", "17", "1", "9"] ["20", "14", "13", "10", "4"] ["18", "7", "6", "5", "4"]
- 21. User Behavior Statistics Goal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner Input: account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
- 22. Statistics Primer With sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
- 23. Statistics Primer (cont.) y = a sessions metric value, ex: time on site ● Sample count: count the number of sessions that viewed the experiment ○ sum(y^0) ● Mean: sum the metric / sample count ○ sum(y^1)/sum(y^0)
- 24. Statistics Primer (cont.) ● Variance: ○ Variance = mean of square minus square of mean ○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2 For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
- 25. Statistics by account statistic_rollup/statistic_summarize.py
- 26. Sample Output ["8", "average session length"] [99, 24463, 7968891] ["8", "conversion rate"] [99, 45, 45] ["9", "average session length"] [115, 29515, 10071591] ["9", "conversion rate"] [115, 55, 55]
- 27. Statistics by experiment statistic_rollup_by_experiment/statistic_summa rize.py
- 28. Sample Output ["9", 0, "average session length"] [32, 8405, 3031009] ["9", 0, "conversion rate"] [32, 20, 20] ["9", 1, "average session length"] [23, 5405, 1770785] ["9", 1, "conversion rate"] [23, 14, 14] ["9", 2, "average session length"] [39, 9481, 2965651] ["9", 2, "conversion rate"] [39, 20, 20] ["9", 3, "average session length"] [25, 6276, 2151014] ["9", 3, "conversion rate"] [25, 13, 13] ["9", 4, "average session length"] [27, 5721, 1797715] ["9", 4, "conversion rate"] [27, 16, 16]
- 29. Questions? ?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment