Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

2,488 views

Published on

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

  1. 1. 26 Trillion App Recommendation using100 Lines of Spark Code Ayman Farahat
  2. 2. ● Motivation ● Spark Implementation ○ Collabrative Filtering ○ Data Frames ○ BLAS-3 ● Results and lessons learnt. Overview
  3. 3. ● App discovery is a challenging problem due to the exponential growth in number of apps ● Over 1.5 million apps available through both market places (i.e. Itunes and Google Play store) ● Develop app recommendation engine using various user behavior signals ○ Explicit Signal (App rating) ○ Implicit Signal (frequency/duration of app usage) Motivation
  4. 4. ● Data available through Flurry SDK is rich in both coverage and depth ● Collected session length for Apps used on IOS platform in period between Sept 1-15 2015 . ● Restricted analysis to Apps used by 100 or more users ○ ~496 million Users ○ ~53,793 Apps Flurry Data and Summary
  5. 5. ● User Count : 496,508,312 ● App Count : 153,773 ● App 100+ : 53,793 ● Train time : 52 minutes ● Predict time : 8 minutes Data Summary
  6. 6. ● Utilize a collaborative filtering based App recommendation ● Run collaborative filtering that works at scale to generate: ○ Low dimension user features ○ Low dimension App features ○ Compute user x App rating for all possible combinations (26.7 Trillion) ● Used spark framework to efficiently train and recommend. Our Approach
  7. 7. ● Projects the users and Apps (in our case) into a lower dimensional space Collaborative Filtering Model
  8. 8. ● Used out of sample prediction accuracy on 20+ Apps Users ● The MSE was minimum with number of factors fixed at 60 Model Fitting and Parameter Optimization
  9. 9. ● Join operation can greatly benefit from caching. ● Filter out Apps that have less than 100 users cleandata = allapps.join(cleanapps) ● Do a replicated join in Spark #only keep the apps that had 100 or more user cleanapps = myapps.filter(lambda x :x[1] > MAXAPPS).map(lambda x: int(x[0])) #persist the apps data apps = sc.broadcast(set(cleanapps.collect())) # filter by the data set: I have simulated a replicated join cleandata = allapps.filter(lambda x: x[1] in apps.value) Data Frames
  10. 10. ● In spark you can use a dataframe directly Record = Row("userId", "iuserId", "appId", "value") MAXAPPS = 100 #transform allapps to a df allappsdf = allapps.map(lambda x: Record(*x)).toDF() # register the DF and issue SQL queries sqlContext.registerDataFrameAsTable(allappdf, "table1") #here I am grouing by the AppID df2 = sqlContext.sql("SELECT appId as appId2, avg(value), count(*) from table1 group by appId") topappsdf = df2.filter(df2.c2 >MAXAPPS) #DF join cleandata = allappsdf.join(topappsdf, allapps.appId == topappdf.appId2) Data Frames
  11. 11. ● The number of possible user x App combinations is very large Default prediction : PredictAll ○predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ○ Prediction is simply matrix multiplication of user “i” and App “j” ● Never completes and most of time spent on reshuffle. ● The users are not partioned so can be on all Nodes. ● The Apps are not partioned so can be on all Nodes. ● Reshuffle is extremely slow. BLAS 3
  12. 12. ● The key is that the Number of Apps << Number of users ● Exploit the low number of Apps to optimize the prediction time BLAS 3
  13. 13. ● The App features being smaller in size can be stored in primary memory (BLAS 3) ● We broadcast the Apps to all executors, which reduces the overall reshuffling of data ● use BLAS-3 matrix multiplication available within numpy which is highly optimized BLAS 3
  14. 14. Basic linear algbera system for solving problems of the form D = a A * b B + c C Highly optimized for matrix multiplication. BLAS 3
  15. 15. import numpy from numpy import * myModel=MatrixFactorizationModel.load(sc, "BingBong”) m1 = myModel.productFeatures() m2 = m1.map(lambda (product,feature) : feature).collect() m3 = matrix(m2).transpose() pf = sc.broadcast(m3) uf = myModel.userFeatures().coalesce(100) #get predictions on all user f1 = uf.map(lambda (userID, features): (userID, squeeze(asarray(matrix(array(features)) * pf.value)))) BLAS 3
  16. 16. Evaluation :Predicted Score
  17. 17. Predicted Score : Positive
  18. 18. Predicted Score : Negative
  19. 19. Evaluation of Recommendation ● Identify users with high(low) scores ● Design of experiment : ● High score x Recommendation ● High score x Placebo ● Low score x Recommendation ● High score x Placebo
  20. 20. Future Work ● Spark econometrics library (std. error, robust std. errors.. ) ● Online experiments to measure value of recommendation . ● Experiments with various implicit ratings : ● number of sessions ● days used ● Log of days used

×