Successfully reported this slideshow.
Upcoming SlideShare
×

# Computational Social Science, Lecture 04: Counting at Scale, Part II

1,934 views

Published on

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Computational Social Science, Lecture 04: Counting at Scale, Part II

1. 1. Counting @ Scale Part II Sharad Goel Columbia UniversityComputational Social Science: Lecture 4 February 15, 2013
2. 2. Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantilessummaries of conditional distributions
3. 3. Counting @ scale conceptually easycomputationally hard
4. 4. MapReduce:Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
5. 5. Mapassign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
6. 6. Mapassign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
7. 7. Group Average Input views (user, movie, rating) Outputaverage (mean & median) by movie
8. 8. Group Average Mapidentity (key := movie) Reducemovie group  average
9. 9. The Insight of MapReduce One can efficiently group identical itemsMany tasks are computationally easier on grouped data
10. 10. Filter Input arbitrary data & filter condition Outputsubset of input data satisfying condition
11. 11. Filter Mapinput  input if condition(input) else pass Reduce identity
12. 12. Distinct Input set of items Outputsubset of distinct items
13. 13. Distinct Map identity Reducegrouped items  single item from group
14. 14. Sample Inputset of items & sample probability p Output random subset of items
15. 15. Sample Mapinput  input if rand(0,1) < p else pass Reduce identity
16. 16. Sort Inputset of items (and a key) Output ordered set of items
17. 17. Sort Mapidentity, with all data assigned to the same key Reduce identity *all the work happens in the shuffle
18. 18. Sort Mapidentity, with key := first letter of line Reduce identity*all the work happens in the shuffle
19. 19. Sort Samplegenerate a small sample of the data (with MapReduce) Determine breakpoints sort the sample and compute percentiles
20. 20. Sort Mapidentity, with key determined by breakpoints Reduce identity *most of the work happens in the shuffle
21. 21. Combining data Example for each user, want to compute theaverage popularity of the movies they watch Problem one file contains views (user, movie);another file contains popularity (movie, rank)
22. 22. JoinsUser Movie 23 829 789 24 User Movie Rank 234 5678 23 829 34 7 24 789 24 100 234 5678 4Movie Rank 7 24 1005678 4 24 100 829 34
23. 23. Nested-Loop JoinsFor each user in users: For each movie in movies: if user.movie_id == movie.id: output user.id, movie.id, movie.rating
24. 24. Sort-Merge JoinsUser Movie 789 24 7 24 User Movie Rank 23 829 789 24 100 234 5678 7 24 100 23 829 34Movie Rank 234 5678 4 24 100 829 345678 4
25. 25. Hash JoinsUser Movie 23 829 789 24 234 5678 7 24Movie Rank5678 4 24 100 829 34
26. 26. Distributed Joins Map reduce key := hash(join key) Reduce local (sort-merge) join*also need to keep track of which table is the left and which is the right
27. 27. Joins{ inner, left, right, outer }
28. 28. User MovieUser Sex 23 82923 male 789 24789 female 234 5678234 female 7 24 7 male 789 9026 male 23 758567 female 23 39 2 female 2 782
29. 29. User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female
30. 30. User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female User Sex Activity 23 male 3 Left Join 789 female 2 234 female 1 7 male 1 26 male 567 female 2 female 1
31. 31. User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female User Sex Activity Inner Join 23 male 3 789 female 2 234 female 1 7 male 1 2 female 1
32. 32. Inner join returns pairs of rows in tables A & B that match join condition Left (outer) join returns all rows from an inner join plusrows in the left table that do not match to the right table Full (outer) join returns all rows from an inner join plus rows in either table that do not match to the other
33. 33. Map-side Joins Map load (smaller) table into memorystream through (larger) table and find matches Reduce identity
34. 34. MapReduce Ops Map-onlyFilter, sample, map-side joins Map & Reduce groupby, distinct, sort, join
35. 35. The long tail Input (user, movie) views Outputfor each user, average popularity of movies they watch
36. 36. Step 1. compute movie popularity group views by movie & count
37. 37. Step 2. Rank movies sort by popularity
38. 38. Step 3. merge view and ranking datajoin views and movie popularity tables
39. 39. Step 4. compute eccentricitygroup views/ranking by user and compute eccentricity
40. 40. Pig Latin:A Not-So-Foreign Language for Data Processing Olston, Reed, Srivastava, Kumar, and Tomkins SIGMOD, 2008