Computational Social Science, Lecture 04: Counting at Scale, Part II

1,551
-1

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,551
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Computational Social Science, Lecture 04: Counting at Scale, Part II

  1. 1. Counting @ Scale Part II Sharad Goel Columbia UniversityComputational Social Science: Lecture 4 February 15, 2013
  2. 2. Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantilessummaries of conditional distributions
  3. 3. Counting @ scale conceptually easycomputationally hard
  4. 4. MapReduce:Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
  5. 5. Mapassign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
  6. 6. Mapassign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
  7. 7. Group Average Input views (user, movie, rating) Outputaverage (mean & median) by movie
  8. 8. Group Average Mapidentity (key := movie) Reducemovie group  average
  9. 9. The Insight of MapReduce One can efficiently group identical itemsMany tasks are computationally easier on grouped data
  10. 10. Filter Input arbitrary data & filter condition Outputsubset of input data satisfying condition
  11. 11. Filter Mapinput  input if condition(input) else pass Reduce identity
  12. 12. Distinct Input set of items Outputsubset of distinct items
  13. 13. Distinct Map identity Reducegrouped items  single item from group
  14. 14. Sample Inputset of items & sample probability p Output random subset of items
  15. 15. Sample Mapinput  input if rand(0,1) < p else pass Reduce identity
  16. 16. Sort Inputset of items (and a key) Output ordered set of items
  17. 17. Sort Mapidentity, with all data assigned to the same key Reduce identity *all the work happens in the shuffle
  18. 18. Sort Mapidentity, with key := first letter of line Reduce identity*all the work happens in the shuffle
  19. 19. Sort Samplegenerate a small sample of the data (with MapReduce) Determine breakpoints sort the sample and compute percentiles
  20. 20. Sort Mapidentity, with key determined by breakpoints Reduce identity *most of the work happens in the shuffle
  21. 21. Combining data Example for each user, want to compute theaverage popularity of the movies they watch Problem one file contains views (user, movie);another file contains popularity (movie, rank)
  22. 22. JoinsUser Movie 23 829 789 24 User Movie Rank 234 5678 23 829 34 7 24 789 24 100 234 5678 4Movie Rank 7 24 1005678 4 24 100 829 34
  23. 23. Nested-Loop JoinsFor each user in users: For each movie in movies: if user.movie_id == movie.id: output user.id, movie.id, movie.rating
  24. 24. Sort-Merge JoinsUser Movie 789 24 7 24 User Movie Rank 23 829 789 24 100 234 5678 7 24 100 23 829 34Movie Rank 234 5678 4 24 100 829 345678 4
  25. 25. Hash JoinsUser Movie 23 829 789 24 234 5678 7 24Movie Rank5678 4 24 100 829 34
  26. 26. Distributed Joins Map reduce key := hash(join key) Reduce local (sort-merge) join*also need to keep track of which table is the left and which is the right
  27. 27. Joins{ inner, left, right, outer }
  28. 28. User MovieUser Sex 23 82923 male 789 24789 female 234 5678234 female 7 24 7 male 789 9026 male 23 758567 female 23 39 2 female 2 782
  29. 29. User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female
  30. 30. User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female User Sex Activity 23 male 3 Left Join 789 female 2 234 female 1 7 male 1 26 male 567 female 2 female 1
  31. 31. User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female User Sex Activity Inner Join 23 male 3 789 female 2 234 female 1 7 male 1 2 female 1
  32. 32. Inner join returns pairs of rows in tables A & B that match join condition Left (outer) join returns all rows from an inner join plusrows in the left table that do not match to the right table Full (outer) join returns all rows from an inner join plus rows in either table that do not match to the other
  33. 33. Map-side Joins Map load (smaller) table into memorystream through (larger) table and find matches Reduce identity
  34. 34. MapReduce Ops Map-onlyFilter, sample, map-side joins Map & Reduce groupby, distinct, sort, join
  35. 35. The long tail Input (user, movie) views Outputfor each user, average popularity of movies they watch
  36. 36. Step 1. compute movie popularity group views by movie & count
  37. 37. Step 2. Rank movies sort by popularity
  38. 38. Step 3. merge view and ranking datajoin views and movie popularity tables
  39. 39. Step 4. compute eccentricitygroup views/ranking by user and compute eccentricity
  40. 40. Pig Latin:A Not-So-Foreign Language for Data Processing Olston, Reed, Srivastava, Kumar, and Tomkins SIGMOD, 2008

×