Salient Features of India constitution especially power and functions
Computational Social Science, Lecture 04: Counting at Scale, Part II
1. Counting @ Scale
Part II
Sharad Goel
Columbia University
Computational Social Science: Lecture 4
February 15, 2013
2. Descriptive statistics
(as opposed to inferential statistics)
is about counting
contingency tables
means, variances, quantiles
summaries of conditional distributions
5. Map
assign each input line to one or more groups
Shuffle
aggregate groups
Reduce
operate on grouped data
6. Map
assign each input line to one or more groups
v [(k1, v1), …, (km, vm)]
Shuffle
aggregate groups
Reduce
operate on grouped data
(k, [v1, …, vn]) [w1, …, wp]
7. Group Average
Input
views (user, movie, rating)
Output
average (mean & median) by movie
8. Group Average
Map
identity (key := movie)
Reduce
movie group average
9. The Insight of MapReduce
One can efficiently group identical items
Many tasks are computationally easier on grouped data
10. Filter
Input
arbitrary data & filter condition
Output
subset of input data satisfying condition
12. Distinct
Input
set of items
Output
subset of distinct items
13. Distinct
Map
identity
Reduce
grouped items single item from group
14. Sample
Input
set of items & sample probability p
Output
random subset of items
15. Sample
Map
input input if rand(0,1) < p else pass
Reduce
identity
16. Sort
Input
set of items (and a key)
Output
ordered set of items
17. Sort
Map
identity, with all data assigned to the same key
Reduce
identity
*all the work happens in the shuffle
18. Sort
Map
identity, with key := first letter of line
Reduce
identity
*all the work happens in the shuffle
19. Sort
Sample
generate a small sample of the data (with MapReduce)
Determine breakpoints
sort the sample and compute percentiles
20. Sort
Map
identity, with key determined by breakpoints
Reduce
identity
*most of the work happens in the shuffle
21. Combining data
Example
for each user, want to compute the
average popularity of the movies they watch
Problem
one file contains views (user, movie);
another file contains popularity (movie, rank)
22. Joins
User Movie
23 829
789 24 User Movie Rank
234 5678 23 829 34
7 24 789 24 100
234 5678 4
Movie Rank
7 24 100
5678 4
24 100
829 34
23. Nested-Loop Joins
For each user in users:
For each movie in movies:
if user.movie_id == movie.id:
output user.id, movie.id, movie.rating
24. Sort-Merge Joins
User Movie
789 24
7 24 User Movie Rank
23 829 789 24 100
234 5678 7 24 100
23 829 34
Movie Rank
234 5678 4
24 100
829 34
5678 4
25. Hash Joins
User Movie
23 829
789 24
234 5678
7 24
Movie Rank
5678 4
24 100
829 34
26. Distributed Joins
Map
reduce key := hash(join key)
Reduce
local (sort-merge) join
*also need to keep track of which table
is the left and which is the right
28. User Movie
User Sex
23 829
23 male
789 24
789 female
234 5678
234 female
7 24
7 male
789 90
26 male
23 758
567 female
23 39
2 female
2 782
29. User Sex
User Activity
23 male
23 3
789 female
789 2
234 female
234 1
7 male
7 1
26 male
789 90
567 female
2 1
2 female
30. User Sex
User Activity
23 male
23 3
789 female
789 2
234 female
234 1
7 male
7 1
26 male
789 90
567 female
2 1
2 female
User Sex Activity
23 male 3
Left Join
789 female 2
234 female 1
7 male 1
26 male
567 female
2 female 1
31. User Sex
User Activity
23 male
23 3
789 female
789 2
234 female
234 1
7 male
7 1
26 male
789 90
567 female
2 1
2 female
User Sex Activity
Inner Join
23 male 3
789 female 2
234 female 1
7 male 1
2 female 1
32. Inner join
returns pairs of rows in tables A & B
that match join condition
Left (outer) join
returns all rows from an inner join plus
rows in the left table that do not match to the right table
Full (outer) join
returns all rows from an inner join plus
rows in either table that do not match to the other
33. Map-side Joins
Map
load (smaller) table into memory
stream through (larger) table and find matches
Reduce
identity