Your SlideShare is downloading. ×
PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

4,362
views

Published on

Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies …

Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Presented by PredictionIO at Big Data TechCon (Oct 17, 2013)

Published in: Technology, Education

0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,362
On Slideshare
0
From Embeds
0
Number of Embeds
44
Actions
Shares
0
Downloads
50
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Donald Szeto Kenneth Chan Simon Chan support@prediction.io Big Data TechCon - Oct 17, 2013 Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
  • 2. Machine Learning is.... computers learning to predict from data
  • 3. putting Machine Learning into practice
  • 4. Existing data tools are like raw material
  • 5. PredictionIO 3 Product Principles
  • 6. Love Developers • • • By Developers, For Developers REST APIs & SDKs Streamlined Data Science Process with UIs
  • 7. Be Open • • • Open Source Github - github.com/predictionio Supported by Mozilla WebFWD
  • 8. For Production
  • 9. challenge #1 Scalability
  • 10. Big Data Bottlenecks Machine Learning Processing
  • 11. PredictionIO has a horizontally scalable architecture
  • 12. challenge #2 Productivity
  • 13. ‣ ‣ Conventional Routine Collecting Data Picking an Algorithm ‣ Writing Scalable Code ‣ Evaluating Results ‣ Tuning Algorithm Parameters ‣ Iterate . . .
  • 14. ‣ Item Similarity ‣ ‣ Algorithms Collaborative Filtering (CF) ??? Item Recommendation ‣ kNN Item-based CF ??? ‣ Threshold Item-based CF ??? ‣ Alternating Least Squares ??? ‣ SlopeOne ??? ‣ Singular Value Decomposition ???
  • 15. MapReduce - Native Java public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws .....{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 16. ‣ Evaluating Results ‣ ‣ What else? But how? Tuning Algorithm Parameters ‣ Lambda ‣ Regularization ‣ # Iterations ‣ ...
  • 17. ‣ Collecting Data ‣ ‣ Using PredictionIO SDKs and API Picking an Algorithm & Writing Scalable Code Better ‣ Ready-to-use Algorithms Routine ‣ Evaluating Results & Tuning Algorithm with Parameters PredictionIO ‣ Ready-to-use Metrics ‣ ‣ Baseline Algorithms All of the Above Driven by a GUI (!)
  • 18. Efficiently tune algorithms as a developer. Develop intuition with how algorithms behave.
  • 19. (Very) Light Intro to kNN CF Users 1 2 3 4 5 6 7 8 9 10 Predict user rating on unseen item Items Users and items 1 3 2 3 2 4 2 ? 3 5 1 1 1 2 5 2 1 4 1 4 3 4 5 5 2 2 5 3 2 1 1 1
  • 20. k-Nearest Neighbor (kNN) Find k most similar items to the targeted items •Cosine / Adjusted Cosine 1 2 3 4 5 6 7 8 9 10 Items •Pearson Correlation Users 1 3 2 •Jaccard coefficient 3 Weight-sum the known ratings of the targeted users on these items 2 4 2 4 ? 1 1 2 5 2 3 5 1 1 4 1 3 4 5 5 2 2 5 3 2 1 1 1
  • 21. k-Nearest Neighbor (kNN) Users Correlation as Item Similarity: Item 1 & 3: R1 = [3, 5, 1, 4] R3 = [3, 4, 2, 5] Corr(R1, R3) = 0.8 Items 1 2 3 4 5 6 7 8 9 10 1 3 2 3 2 3 4 5 5 4 2 1 4 1 4 1 1 2 5 2 5 3 2 1 2 2 3 5 1 1 1
  • 22. k-Nearest Neighbor (kNN) Sij: similarity score of item i and item j (4 * 0.8 + 2 * 0.9) /(0.8 + 0.9) = 5 / 1.7 = 2.9 1 2 3 4 5 6 7 8 9 10 Items Let S13 = 0.8, S35 = 0.9, predicted score of User 6 on Item 3 would be: Users 1 3 2 3 2 4 2 ? 3 5 1 1 1 2 5 2 1 4 1 4 3 4 5 5 2 2 5 3 2 1 1 1
  • 23. PredictionIO In Action
  • 24. Something is missing here...
  • 25. “If you follow this startup, you may also want to follow these X, Y, Z startups.”
  • 26. How are we going to do this? Data +
  • 27. What data do we need? Users - the followers (AngelList users) Items - the startups Actions - the “follow” actions
  • 28. What data do we need? For this demo... - Use AngelList API (https://angel.co/api) - Retrieve all startups (Items) belonging to the following accelerators: - Retrieve all users following these startups (Users and Actions). - 16382 users. 993 items. 204176 actions.
  • 29. What data do we need? Generate these files... users.csv - list of unique user ids <user_id> startup_id_name_url_incubator.csv - list of startups with ids, names, AngelList URL and incubators <startup_id>, <name>, <url>, <incubator> follows.csv - list of ids for startups and users who follow the startup <startup_id>, <user_id>
  • 30. How to integrate with PredictionIO?
  • 31. How to integrate with PredictionIO?
  • 32. How to integrate with PredictionIO? Python SDK is used in this demo (other SDKs are similar).... # PredictionIO Client Object client = predictionio.Client(“APP_KEY”) # Import users and items client.create_user(“user_id”) client.create_item(“startup_id”, (“item_type_1”,)) # Example client.create_user(“5”) client.create_item(“17”, (“startup”, “y-combinator”,))
  • 33. How to integrate with PredictionIO? Python SDK is used in this demo (other SDKs are similar).... # Import actions client.identify(“user_id”) client.record_action_on_item(“action_name”, “startup_id”) PredictionIO comes with the following built-in actions: “like”, “dislike”, “rate”, “view”, “conversion” # Example client.identify(“5”) client.record_action_on_item(“like”, “17”) # use like to represent follow
  • 34. How to integrate with PredictionIO? Python SDK is used in this demo (other SDKs are similar).... # More advanced usage client = predictionio.Client(APP_KEY, THREADS, API_URL, qsize=REQUEST_QSIZE) # Asynchronous version client.acreate_user(“user_id”) client.acreate_item(“item_id”, (“item_type_1”,)) client.arecord_action_on_item(“action_name”, “item_id”)
  • 35. How to integrate with PredictionIO?
  • 36. How to integrate with PredictionIO?
  • 37. How to integrate with PredictionIO?
  • 38. How to retrieve prediction results? Python SDK is used in this demo (other SDKs are similar).... # retrieve prediction results from engine rec = client.get_itemsim_topn(“engine_name”, “startup_id”, num) # Example rec = client.get_itemsim_topn(“my-engine”, “17”, 10) # Asynchronous version req = client.aget_itemsim_topn(“engine_name”, “startup_id”, num) # can do something else in between..... rec = client.aresp(req)
  • 39. Demo Setup Retrieve prediction results through REST API Batch import data through Python SDK + Retrieve startups data through REST API Store and retrieve startups data https://github.com/PredictionIO/Demo-AngelList
  • 40. Analytics platform for mobile & web. E-commerce across social, mobile, and web Data analysis tool Analytics Backend as a Service (web + mobile + internet of things)
  • 41. Production and distribution of visual content Mobile Application Performance Monitoring Mobile monetization Payments for marketplaces and crowdfunding Xbox Kinect for iPhone Analytics Backend as a Service (web + mobile + internet of things)
  • 42. Live shows Youtube network for Artists Analytics Backend as a Service (web + mobile + internet of things)
  • 43. Production and distribution of visual content Mobile Application Performance Monitoring Analytics platform for mobile & web. Mobile monetization E-commerce across social, mobile, and web Data analysis tool Live shows Youtube network for Artists Payments for marketplaces and crowdfunding Xbox Kinect for iPhone Analytics Backend as a Service (web + mobile + internet of things)
  • 44. How to try different algorithms?
  • 45. How to try different algorithms?
  • 46. How to try different algorithms?
  • 47. How to change algorithm parameters?
  • 48. How to change algorithm parameters?
  • 49. How do we evaluate how good the prediction results are? We need metrics!
  • 50. MAP@k
  • 51. MAP@k (Mean Average Precision at k) For this user A... Retrieve Top 5 items (prediction): Relevant items (actual behavior): item4 item7 item10 item10 item6 item25 item15 item7 Total relevant items: 3
  • 52. MAP@k (Mean Average Precision at k) For this user A... Retrieve Top 5 items (prediction): Relevant items (actual behavior): item4 item7 item10 item10 item6 item25 item15 item7 Total relevant items: 3 How good is this prediction?
  • 53. MAP@k (Mean Average Precision at k) For this user A... Retrieve Top 5 items (prediction): Position: Number of relevant items up to this position: P@k: item4 1 0 0 item10 2 1 1/2 item6 3 1 0 item15 4 1 0 item7 5 2 2/5 AP@5 = (1/2 + 2/5) / min(5,3) Total relevant items: 3 MAP@5 = average of AP@5 of all users
  • 54. ISMAP@k - Extension to MAP@k for Item Similarity Engine
  • 55. MAP@k For each user.... Retrieve Top 5 items (prediction): Relevant items (actual behavior): item4 item7 item10 item10 item6 item25 item15 item7 AP@5 = (1/2 + 2/5) / min(5,3) MAP@5 = average of AP@5 of all users
  • 56. ISMAP@k For each item X... For each user who “follows” this item X.... Retrieve Top 5 similar items to X that the user may also follow (prediction): Relevant items (actual behavior): item4 item7 item10 item25 MAP@5 = average of AP@5 of all users item10 item6 AP@5 = (1/2 + 2/5) / min(5,3) ISMAP@5 = average of MAP@5 of all items item15 item7 We have metrics, but how to evaluate?
  • 57. Offline Evaluation
  • 58. Offline Evaluation What does it mean by “relevant”? What’s the prediction goal? Retrieve Top 5 items (prediction): Relevant items (actual behavior): Get relevant items Test set (Tested with unseen data) item4 item7 item10 item10 item6 Training set item25 item15 item7 Retrieve prediction Training (treated like unseen) Split Data (seen)
  • 59. How to set the goal?
  • 60. How to set the goal?
  • 61. How do we evaluate how good the prediction results are?
  • 62. How do we evaluate how good the prediction results are?
  • 63. Sample evaluation scores Default-Algo 0.0335 random 0.0039
  • 64. ‣ Predict User Preference Smarter Ride-Sharing Apps ‣ Personalized Meal Recommendation ‣ Intelligent Online Course Discovery ‣ App Discovery ‣ and many more...
  • 65. Getting Started! - @PredictionIO prediction.io github.com/predictionio