Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce An Example
Who am I?My name is James Grant (email@example.com).Im a developer here at Brandwatch.For the last three years Ive been a DataEngineer at Last.fm and the maintainer of theirHadoop Cluster.
Coming up…● What happens during MapReduce?● Plays and Reach from music listening data● The Mapper pseudo code● The Reducer pseudo code● The result● What if…?
What happens during MapReduce?Input Data Data Data Fragment Mapper MapData Fragment Fragment Output Sort Data Data Reduce Reducer Fragment Reducer Fragment Output Input
Plays and Reach from musiclistening data● Plays - The number of times that song has been played● Reach - The number of unique listeners to that song● Similar to hits and uniques for web properties● Input data has columns for user id and song id (amongst others)
The Mapperfunction map(Integer user, Integer song): emit(song, user);
The Reducerfunction reduce(Integer song, Iterator users): Integer plays = 0; Set uniqueUsers = ; foreach user in users: increment plays; if user not within uniqueUsers: uniqueUsers.add(user); result.plays = plays; result.reach = uniqueUsers.cardinality(); emit(song, result);
What if…?You often hear that for nearly all cases youshould use a higher level tool like Pig or Hive tosolve problems.So what does the Pig script look like for thisproblem?
Using Pigsubs = LOAD submissions.tsv USING PigStorage() AS (user:int, song:int);songs = GROUP subs BY song;songs = FOREACH songs GENERATE group AS song, subs.user;songs = FOREACH songs GENERATE song, COUNT($1.user), COUNT(Distinct($1.user));STORE songs INTO playsreach.tsv;