Recommendations with hadoop streaming and python
Upcoming SlideShare
Loading in...5
×
 

Recommendations with hadoop streaming and python

on

  • 4,726 views

 

Statistics

Views

Total Views
4,726
Views on SlideShare
4,578
Embed Views
148

Actions

Likes
10
Downloads
105
Comments
2

4 Embeds 148

http://andrewlook.com 144
http://www.tumblr.com 2
http://www.scoop.it 1
http://andrewlook.tumblr.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Great slides!

    I am building the next generation big data and mobile cloud technology that will change the way we do things on mobile and cloud!

    I am living in the Silicon Valley, USA. I am looking for a TECHNICAL co-founder who is an expert in mobile development or big data, social networking experience would be a big plus.

    Please email me TODAY and join me as the Co-Founder!
    Are you sure you want to
    Your message goes here
    Processing…
  • hi, could you give the python code to me? I'm writing a movie recommendation algorithm, but I'm trouble in turn it into mapreduce. Thank you! My email is listentowindy@gmail.com.
    Thank you!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Recommendations with hadoop streaming and python Recommendations with hadoop streaming and python Presentation Transcript

  • Recommendations withPython and HadoopStreaming Andrew Look Senior Engineer Shopzilla
  • Getting started● Slides ○ http://bit.ly/J7vmx7● Python/NumPy Installed ○ http://bit.ly/JWNWbq● Sample code ○ http://aws-hadoop.s3.amazonaws.com/similarity.zip
  • Outline● Problem● Recommendation basics● MapReduce review and conventions● Python + Hadoop Streaming basics● MapReduce jobs (data, code, data-flow)● Recommendation algorithm
  • Problem - Music Recommendations● We want to recommend similar artists● We have data from Last.fm● Which Last.fm users liked which artists?● How can we decide which artists are similar? Toby Keith Tupac De La Soul Garth Brooks
  • Solution - Find Artist Similarities● Well follow along with a tutorial from AWS● By Data Wrangling blogger/AWS developer Peter Skomoroch● Uses publicly available data from Last.fm● Users rating of artist is number of plays
  • Solution - Find Artist Similarities● We can look at co-ratings● One user played artist A songs X times● Same user played artist B songs Y times co-rating = ((A,X),(B,Y))
  • Recommendation Basics● User Based ○ Given a user, recommend the artists that are favored by users with similar artist preferences● Item Based ○ Given an item (artist), recommend the artists that were most commonly favored by users that also liked the input artist
  • Recommendation Basics● Types of data ○ Explicit - user rates a movie on Netflix ○ Implicit - user watches a YouTube video● Types of ratings ○ Multivalued - bounded, ex. star rating (1-5) ○ Multivalued - unbounded, ex. number of plays (>0) ○ Binary - did a user play a movie or not?
  • Last.fm Recommendations● Data was implicitly collected (as users play songs)● Transform binary data (did user listen to artist?) ...● Into multivalued data (how many times?)● Well use item-based recommendations
  • Mapper Input
  • Map Output - Reduce Input
  • Chaining MapReduce Jobs
  • Distributed Cache
  • Python Shell and Hadoop StreamingStreaming API requires shell commands ● Mapper ● Reducer
  • Python Shell and Hadoop StreamingStreaming API requires shell commands ● Mapper ● ReducerFor mapper / reducer commands StreamingAPI will ● Partition the input ● Distribute across mappers and reducers
  • Python Shell and Hadoop Streaming
  • Full Recommendation Job Overview
  • Example - Working Data Set○ Inspect your working data set ...○ Each row is one "rating"○ Each "number of plays" is the "rating value" Code cat input/sample_user_artist_data.txt | head
  • Example - Working Data SetUser ID Artist ID Number of Plays1000020 1001820 201000020 1003557 11000021 700 11000029 1001819 11000036 1001820 341000036 1011819 21000036 700 21000040 1001820 11000057 1011819 371000060 700 17
  • Mapper 1 - Count Ratings per Artist○ Prepend LongValueSum:<artist ID>○ More on this later○ Use a value of "1" Code cat input/sample_user_artist_data.txt | ./similarity.py mapper1
  • Mapper 1 - Count Ratings per Artist Artist ID Number of Ratings LongValueSum:1001820 1 LongValueSum:1003557 1 LongValueSum:700 1 LongValueSum:1001819 1 LongValueSum:1001820 1 LongValueSum:1011819 1 LongValueSum:700 1 LongValueSum:1001820 1 LongValueSum:1011819 1 LongValueSum:700 1
  • Mapper 1 - Count Ratings per Artist○ We use the sort command locally○ We sort by artist ID○ Emulates shuffle/sort in Hadoop Code cat input/sample_user_artist_data.txt | ./similarity.py mapper1 | sort
  • Mapper 1 - Count Ratings per Artist Artist ID Number of Plays LongValueSum:1001820 1 LongValueSum:1001820 1 LongValueSum:1001820 1 LongValueSum:1003557 1 LongValueSum:1011819 1 LongValueSum:1011819 1 LongValueSum:1011819 1 LongValueSum:700 1 LongValueSum:700 1 LongValueSum:700 1
  • Reducer 1 - Count Ratings by Artist○ LongValueSum tells aggregate reducer ○ Group by artist ID ○ Sum up the 1s ○ Emit artist ID as Key, count(ratings) as Value Code cat input/sample_user_artist_data.txt | ./similarity.py mapper1 | sort | ./similarity.py reducer1 > input/artist_playcounts.txt
  • Reducer 1 - Count Ratings by Artist Artist ID Number of Ratings 1000143 1905 1000418 184 1001820 12950 700 7243 1003557 2976 1011819 7601 1012511 1881
  • Mapper 2 - User Artist Preferences○ Mapper2 outputs key user ID, artist ID○ Mapper2 outputs rating as value (# plays) Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 int
  • Mapper 2 - User Artist Preferences User ID, Artist ID Number of Plays 1000020,1001820 20 1000020,1003557 1 1000021,700 1 1000029,1011819 1 1000036,1001820 34 1000036,1011819 2 1000036,700 2 1000040,1001820 1 1000057,1011819 37 1000060,700 17
  • Mapper 2 - User Artist Preferences○ Can large counts skew our results?○ Apply log function to outliers. Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort
  • Mapper 2 - Logarithmic Smoothing User ID, Artist ID Smoothing Smoothed Count 1000020,1001820 log(20) 3 1000020,1003557 log(1) 1 1000021,700 log(1) 1 1000029,1011819 log(1) 1 1000036,1001820 log(34) 4 1000036,1011819 log(2) 1 1000036,700 log(2) 1 1000040,1001820 log(1) 1 1000057,1011819 log(37) 4 1000060,700 log(17) 3
  • Reducer 2 - Aggregate User Prefs○ Reduce for each user○ Key - user ID○ Value is complex ○ Count(ratings) ○ Sum(rating values) ○ Space delimited list of - artist ID, rating value Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort | ./similarity.py reducer2
  • Reducer 2 - Aggregated User Prefs User ID Smoothing 1000020 2 | 4 | 1001820,3 1003557,1 1000021 1 | 1 | 700,1 1000029 1 | 1 | 1011819,1 1000036 3 | 6 | 1001820,4 1011819,1 700,1 1000040 1 | 1 | 1001820,1 1000057 1 | 4 | 1011819,4 1000060 1 | 3 | 700,3
  • Mapper 3 - User Co-Ratings○ Mapper3 culls users via cutoff○ Drop user ID, emit pairwise Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort | ./similarity.py reducer2 | ./similarity.py mapper3 100 input/artist_playcounts.txt | sort
  • Mapper 3 - User Co-Ratings Artist ID: X, Y Rating: X, Y 1000143 1003577 2 3 1000143 1011819 2 3 1001820 700 1 2 1001820 700 1 3 1011819 700 3 2 1011819 700 3 3 1011819 700 4 2 1011819 700 4 2 1011819 700 5 5 1012511 700 1 1
  • Reducer 3 - Artist Similarities○ Given num artists, computes similarities○ Each pair of artists emitted w/ similarities Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort | ./similarity.py reducer2 | ./similarity.py mapper3 100 input/artist_playcounts.txt | sort | ./similarity.py reducer3 147160 > artist_similarities.txt
  • Reducer 3 - Artist Similarities Artist ID, Similarity, Artist ID, Co-Ratings 1003557 0.121659425105 1012511 360 1012511 0.121659425105 1003557 360 1003557 0.0197107349416 700 212 700 0.0197107349416 1003557 212 1011819 0.0128808637553 1012511 259 1012511 0.0128808637553 1011819 259 1011819 0.297222927702 700 3050 700 0.297222927702 1011819 3050 1012511 0.0426446192482 700 270 700 0.0426446192482 1012511 270
  • Mapper 4 - Sort by Artist Correlation○ Emit artist ID, similarity concatenated○ Sort by similarity = recommendation Code cat artist_similarities.txt | ./similarity.py mapper4 20 | sort
  • Mapper 4 - Sort by Artist Correlation Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings 1012511,0.924219271937 1000143 237 1012511,0.945653412649 1001820 468 1012511,0.957355380752 700 270 1012511,0.961454917198 1000418 50 1012511,0.987119136245 1011819 259 700,0.702777072298 1011819 3050 700,0.898811337303 1001820 2250 700,0.95212801312 1000143 114 700,0.957355380752 1012511 270 700,0.980289265058 1003557 212
  • Reducer 4 - Cosmetic Results○ Reducer attaches artist names Code cat artist_similarities.txt | ./similarity.py mapper4 20 | sort | ./similarity.py reducer4 3 lastfm/artist_data.txt > related_artists.tsv
  • Reducer 4 - Cosmetic ResultsArtist ID Related Artist Similarity Number of Co- Artist Name ID Ratings1000143 1000143 1 0 Toby Keith1000143 1003557 0.2434 809 Garth Brooks1000143 1000418 0.1068 120 Mark Chestnutt1000143 1012511 0.0758 237 Kenny Rogers1000418 1000418 1 0 Mark Chestnutt1000418 1000143 0.1068 120 Toby Keith1000418 1003557 0.056 114 Garth Brooks1000418 1012511 0.0385 50 Kenny Rogers
  • Pearson Similarity - Visualization covariance(A, B) = 2.44 covariance(C, D) = -2.36
  • Pearson Similarity - Equationpearson(x, y) = covariance(x, y) / (stddev(x) * stddev(y)) pearson(A, B) = 0.772 pearson(C, D) = -0.746
  • Pearson Similarity - Summary○ Pearson similarity normalizes correlation○ Linear dependence between two variables○ Normalized ... -1 < pearson(x, y) < 1 (for any x, y)
  • Questions?
  • Appendix● Hadoop Streaming ○ http://hadoop.apache.org/common/docs/r0.20.1/streaming.html● Explanation of LongValueSum ○ http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce● Pearson Correlation ○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient● Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming ○ http://aws.amazon.com/articles/2294
  • Appendix● Anscombes Quartet ○ http://en.wikipedia.org/wiki/Anscombes_quartet● Tau Coefficient ○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient● Jaccard Index ○ http://en.wikipedia.org/wiki/Jaccard_index● Quality of Recommendations ○ http://en.wikipedia.org/wiki/Mean_squared_error