• Save
Computational Social Science, Lecture 04: Counting at Scale, Part II
Upcoming SlideShare
Loading in...5
×
 

Computational Social Science, Lecture 04: Counting at Scale, Part II

on

  • 1,193 views

 

Statistics

Views

Total Views
1,193
Views on SlideShare
302
Embed Views
891

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 891

http://compsocialscience.org 887
http://compsocialscience.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Computational Social Science, Lecture 04: Counting at Scale, Part II Computational Social Science, Lecture 04: Counting at Scale, Part II Presentation Transcript

  • Counting @ Scale Part II Sharad Goel Columbia UniversityComputational Social Science: Lecture 4 February 15, 2013
  • Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantilessummaries of conditional distributions
  • Counting @ scale conceptually easycomputationally hard
  • MapReduce:Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
  • Mapassign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
  • Mapassign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
  • Group Average Input views (user, movie, rating) Outputaverage (mean & median) by movie
  • Group Average Mapidentity (key := movie) Reducemovie group  average
  • The Insight of MapReduce One can efficiently group identical itemsMany tasks are computationally easier on grouped data
  • Filter Input arbitrary data & filter condition Outputsubset of input data satisfying condition
  • Filter Mapinput  input if condition(input) else pass Reduce identity
  • Distinct Input set of items Outputsubset of distinct items
  • Distinct Map identity Reducegrouped items  single item from group
  • Sample Inputset of items & sample probability p Output random subset of items
  • Sample Mapinput  input if rand(0,1) < p else pass Reduce identity
  • Sort Inputset of items (and a key) Output ordered set of items
  • Sort Mapidentity, with all data assigned to the same key Reduce identity *all the work happens in the shuffle
  • Sort Mapidentity, with key := first letter of line Reduce identity*all the work happens in the shuffle
  • Sort Samplegenerate a small sample of the data (with MapReduce) Determine breakpoints sort the sample and compute percentiles
  • Sort Mapidentity, with key determined by breakpoints Reduce identity *most of the work happens in the shuffle
  • Combining data Example for each user, want to compute theaverage popularity of the movies they watch Problem one file contains views (user, movie);another file contains popularity (movie, rank)
  • JoinsUser Movie 23 829 789 24 User Movie Rank 234 5678 23 829 34 7 24 789 24 100 234 5678 4Movie Rank 7 24 1005678 4 24 100 829 34
  • Nested-Loop JoinsFor each user in users: For each movie in movies: if user.movie_id == movie.id: output user.id, movie.id, movie.rating
  • Sort-Merge JoinsUser Movie 789 24 7 24 User Movie Rank 23 829 789 24 100 234 5678 7 24 100 23 829 34Movie Rank 234 5678 4 24 100 829 345678 4
  • Hash JoinsUser Movie 23 829 789 24 234 5678 7 24Movie Rank5678 4 24 100 829 34
  • Distributed Joins Map reduce key := hash(join key) Reduce local (sort-merge) join*also need to keep track of which table is the left and which is the right
  • Joins{ inner, left, right, outer }
  • User MovieUser Sex 23 82923 male 789 24789 female 234 5678234 female 7 24 7 male 789 9026 male 23 758567 female 23 39 2 female 2 782
  • User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female
  • User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female User Sex Activity 23 male 3 Left Join 789 female 2 234 female 1 7 male 1 26 male 567 female 2 female 1
  • User Sex User Activity23 male 23 3789 female 789 2234 female 234 1 7 male 7 126 male 789 90567 female 2 1 2 female User Sex Activity Inner Join 23 male 3 789 female 2 234 female 1 7 male 1 2 female 1
  • Inner join returns pairs of rows in tables A & B that match join condition Left (outer) join returns all rows from an inner join plusrows in the left table that do not match to the right table Full (outer) join returns all rows from an inner join plus rows in either table that do not match to the other
  • Map-side Joins Map load (smaller) table into memorystream through (larger) table and find matches Reduce identity
  • MapReduce Ops Map-onlyFilter, sample, map-side joins Map & Reduce groupby, distinct, sort, join
  • The long tail Input (user, movie) views Outputfor each user, average popularity of movies they watch
  • Step 1. compute movie popularity group views by movie & count
  • Step 2. Rank movies sort by popularity
  • Step 3. merge view and ranking datajoin views and movie popularity tables
  • Step 4. compute eccentricitygroup views/ranking by user and compute eccentricity
  • Pig Latin:A Not-So-Foreign Language for Data Processing Olston, Reed, Srivastava, Kumar, and Tomkins SIGMOD, 2008