Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
REMINDER
Check in on the COLLABORATE
mobile app
Building Recommendation
Platforms with Hadoop
Prepared by:
Jayant Shekhar
...
Agenda
■ Why Big Data Recommendation Platform?
■ Common Recommendation Patterns & Algorithms
■ Lambda Architecture
■ Archi...
Recommendations is one of the commonly used use cases of Hadoop
Recommendations can be
Recommendations Broader Use Cases
•...
• Web
• Mobile
• Email
• Postal mail
• Newspaper/Magazine
ads
Recommendations are delivered through Data Sets Involved
• I...
Common Recommendation
Patterns & associated
Algorithms
Some ML Algorithms used for Recommendations
• Collaborative Filtering
• Clustering
• Classification &
Regression
• Pattern...
Product Recommendation
Use Cases
• Recommend Product
• Recommend Movies/Videos
Algorithms
• Collaborative Filtering
• Logi...
Related Searches or Query Recommendations
Design
• Use Query Log Data
• Cluster similar queries
• Use Parallel FP Growth t...
Social/People Recommendations
Use Case
• Recommend Missing Links in a
Social Network
• Bipartite Matching –
Recommend Men/...
Lambda Architecture
Lambda Architecture
Stream Processing
Realtime View
New Data Stream
All Data
Pre-compute
Views
Batch View Batch View
Query...
Lambda Architecture
Large Scale Offline Batch
+
Real-time Online Streaming
■ Batch Layer : offline, asynchronous
■ Serving...
Computation & Serving Layers
for Recommendations
Oryx Lambda Architecture
Closer View of Oryx Serving & Model Generation
HDFS
Serving
Layer
Serving
Layer
Serving
Layer
A
P
I
Generation 0 Generatio...
Feature Generation & Model Building
HDFS
Data Data
Data Data
Data Data
Feature Generation
Model Model
Model Model
Model Mo...
Requirements for ML on Hadoop
■ Model Building
▪ Large Scale Distributed
▪ Continuous
■ Model Serving
▪ Real-time query
▪ ...
Computation Layer Vs Serving Layer
■ Computation Layer
▪ Periodically builds generation from recent data and past model
▪ ...
Collaborative Filtering : ALS
■ Alternating Least Squares
■ Matrix Factorization
■ Faster than SVD
■ Real-time update
■ Pa...
Clustering : K-means++
■ Well-known and understood
■ Parallelizable
■ Clusters Updateable
■ Obtains an initial set of cent...
Classification/Regression : RDF
■ Random Decision Forests
■ Ensemble Method
■ Numerical, Categorical features and target
■...
Social Recommendation with
Giraph
Graph Use Cases
• Social Recommendations
• Recommend missing links in a social network
• Twitter Graph
• Who to follow
• S...
Giraph
■ Each vertex has an id, a value, a list of its adjacent
neighbour ids and the corresponding edge values
■ Edges ar...
Giraph BSP
Giraph BSP
■ Input is a directed graph
■ Each vertex is invoked in each superstep, can
recompute its value and send messag...
ML Algorithms with Graph Processing
■ Collaborative Filtering
■ Clustering
■ Gradient Descent : Linear Regression, Logisti...
Matrix factorization
M = U X
V
ALS : fix one side and solve for the other
Representing Matrix by Graphs
3 - 8
- 9 5
5 - -
3
1
2
1
2
3
Row Column
3
8
9
55
• every vertex holds a row vector
Recommendations with Solr
Lucene Inverted Index
Term Documents
framework 1[1x]
for 1[1x] , 5[1x]
job 1[1x]
data 2[1x] , 4[1x]
… ...
and 3[1x], 4[1x]...
Recommendation Approaches in Solr
■ Attribute-based
■ Textual Similarity-based
■ More-like-this
■ Collaborative Filtering
Attribute-based Recommendations
■ Example: Match User Attributes to Item Attribute Fields
/solr/select/?q=(grouptitle:”big...
Textual Similarity-based Recommendations
■ Solr’s MoreLikeThis Search Component.
■ Extracts important keywords from one or...
Content Recommendation
■ Even a single keyword can be enough to begin making meaningful
recommendations.
■ Filtering or bo...
Behavior Based Recommendation Approaches
Collaborative Filtering : Uses who likes these also liked…
■ Step 1: Find similar...
Cloudera Search Architecture
HDFS
Online Streaming Data
End User Client App
(e.g. Hue)
Flume
Raw, filtered, or
annotated d...
Storm & Recommendations
Real-time Architecture using Storm & Hadoop
Key/Value StoreStorm
Incoming Data
Hadoop
Query
Real-time and Storm
■ The query layer queries the real-time and batch and merges
the result
■ Some algorithms are hard to ...
Storm/Track Realtime Events
■ Real-time streaming analytics/stats on consumer viewing behavior
and digital content trends....
Training of Models
A/B Testing
Offline Training & Testing of Models
Use Cases
• Recommend Missing Links in a Social Network
• How are users connected
• C...
A/B Testing
Traffic
New Model
Old Model
X%
(100-X)%
A/B testing is used to test the performance of the Models online
A/B t...
Please complete the session
evaluation on the mobile app
We appreciate your feedback and insight
Trends, Aggregations & Counters
• Most Popular Searches/Downloads/News Articles/Movies/Products
• Load results into HBase
...
Building Recommendation Platforms with Hadoop
Upcoming SlideShare
Loading in …5
×

Building Recommendation Platforms with Hadoop

1,977 views

Published on

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building Recommendation Platforms with Hadoop

  1. 1. REMINDER Check in on the COLLABORATE mobile app Building Recommendation Platforms with Hadoop Prepared by: Jayant Shekhar Sr. Solutions Architect Cloudera
  2. 2. Agenda ■ Why Big Data Recommendation Platform? ■ Common Recommendation Patterns & Algorithms ■ Lambda Architecture ■ Architecture & Design of Computation & Serving Layers ■ Social Recommendations with Giraph ■ Recommendations with Solr ■ Recommendation with Storm/HBase
  3. 3. Recommendations is one of the commonly used use cases of Hadoop Recommendations can be Recommendations Broader Use Cases • Product Recommendation • People/Social Recommendation • Merchant Recommendation • Content Recommendation • Query Recommendation • Sponsored Search Advertising • Realtime • News Recommendations • Merchant/Offer Recommendations on mobile • Offline • Similar Profiles/Resumes • In Between
  4. 4. • Web • Mobile • Email • Postal mail • Newspaper/Magazine ads Recommendations are delivered through Data Sets Involved • Items/Products/Content • Transaction Data • User Data • Logs & User Activity • Additional 3rd Party Data • Geo • Social • Reviews • … Different Time to Action Targeted • User would view the content now/Buy the Product Now • User would buy the product in a week • Next when he/she goes grocery shopping • User would buy the product in the next 3 months • TV/Dishwasher etc. • Vacation Also need to be able to determine/differentiate between the users in a household
  5. 5. Common Recommendation Patterns & associated Algorithms
  6. 6. Some ML Algorithms used for Recommendations • Collaborative Filtering • Clustering • Classification & Regression • Pattern Mining Collaborative Filtering Clustering • ALS • SVD • Slope One Recommender • K-means • Canopy • Fuzzy K-Means • Parallel FP-Growth • Logistic Regression • Naïve Bayes • Random Forest Classification & Regression Pattern Mining CLASS
  7. 7. Product Recommendation Use Cases • Recommend Product • Recommend Movies/Videos Algorithms • Collaborative Filtering • Logistic Regression Frequently bought/viewed together Use Cases • Find items that are frequently bought together • Related Searches/Query Suggestion • View Item Page Algorithms • Parallel FP- Growth://infolab.stanford.edu/~ echang/recsys08-69.pdf
  8. 8. Related Searches or Query Recommendations Design • Use Query Log Data • Cluster similar queries • Use Parallel FP Growth to find the related searches Query Distance • Based on keywords or phrases • Based on searches in the same session • Based on common clicked URLs • Based on the distance of the clicked documents Related Articles/News • Batch clustering with K-Means • NRT clustering using the centroids • Perform canopy on left over articles
  9. 9. Social/People Recommendations Use Case • Recommend Missing Links in a Social Network • Bipartite Matching – Recommend Men/Women Design • Take existing edges and friend of friends • Build Regression Models based on latest activity • Scale easily offline with Hadoop as number of friends of friends and activities could be very high. • Giraph
  10. 10. Lambda Architecture
  11. 11. Lambda Architecture Stream Processing Realtime View New Data Stream All Data Pre-compute Views Batch View Batch View Query Lambda architecture proposed by Nathan Marz, creator of Storm
  12. 12. Lambda Architecture Large Scale Offline Batch + Real-time Online Streaming ■ Batch Layer : offline, asynchronous ■ Serving Layer : real-time, incremental, approximate
  13. 13. Computation & Serving Layers for Recommendations
  14. 14. Oryx Lambda Architecture
  15. 15. Closer View of Oryx Serving & Model Generation HDFS Serving Layer Serving Layer Serving Layer A P I Generation 0 Generation 1 Generation 2 Computation Layer Generation directory contains: • Input data • Configuration • Model Generation 3
  16. 16. Feature Generation & Model Building HDFS Data Data Data Data Data Data Feature Generation Model Model Model Model Model Model Model Generation Hadoop enables easy iteration over the process of Model Generation and testing it out offline.
  17. 17. Requirements for ML on Hadoop ■ Model Building ▪ Large Scale Distributed ▪ Continuous ■ Model Serving ▪ Real-time query ▪ Real-time updates ■ Algorithms ▪ Parallelizable ▪ Updateable ■ Interoperable ▪ PMML model format ▪ Simple REST API ▪ Open Source
  18. 18. Computation Layer Vs Serving Layer ■ Computation Layer ▪ Periodically builds generation from recent data and past model ▪ Baby sits MR job ▪ Publishes Model ■ Serving Layer ▪ Consumes Model ▪ Serves queries from model in memory ▪ Updates the model from new input ▪ Also writes input to HDFS ▪ Replicas for scale
  19. 19. Collaborative Filtering : ALS ■ Alternating Least Squares ■ Matrix Factorization ■ Faster than SVD ■ Real-time update ■ Parallelizable
  20. 20. Clustering : K-means++ ■ Well-known and understood ■ Parallelizable ■ Clusters Updateable ■ Obtains an initial set of centers that is close to the optimum solution.
  21. 21. Classification/Regression : RDF ■ Random Decision Forests ■ Ensemble Method ■ Numerical, Categorical features and target ■ Very Parallel ■ Nodes Updateable
  22. 22. Social Recommendation with Giraph
  23. 23. Graph Use Cases • Social Recommendations • Recommend missing links in a social network • Twitter Graph • Who to follow • Similar To • Bipartite Matching • Matching job/employees, men/women • How are users connected • Clustering – find related people in groups
  24. 24. Giraph ■ Each vertex has an id, a value, a list of its adjacent neighbour ids and the corresponding edge values ■ Edges are always directed ▪ Out-edges attached to a node ▪ Nodes can’t see inbound edges ■ Nodes communicate via messages ■ No remote reads
  25. 25. Giraph BSP
  26. 26. Giraph BSP ■ Input is a directed graph ■ Each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers ■ This is done till every Vertex votes to halt ■ Output is a directed graph
  27. 27. ML Algorithms with Graph Processing ■ Collaborative Filtering ■ Clustering ■ Gradient Descent : Linear Regression, Logistic Regression
  28. 28. Matrix factorization M = U X V ALS : fix one side and solve for the other
  29. 29. Representing Matrix by Graphs 3 - 8 - 9 5 5 - - 3 1 2 1 2 3 Row Column 3 8 9 55 • every vertex holds a row vector
  30. 30. Recommendations with Solr
  31. 31. Lucene Inverted Index Term Documents framework 1[1x] for 1[1x] , 5[1x] job 1[1x] data 2[1x] , 4[1x] … ... and 3[1x], 4[1x] wide 5[1x] variety 5[2x] … … Document Content Field 1 framework for job scheduling 2 data warehouse infrastructure and 3 fast and general compute engine 4 data serialization system and 5 wide variety of companies … … Input Documents Index
  32. 32. Recommendation Approaches in Solr ■ Attribute-based ■ Textual Similarity-based ■ More-like-this ■ Collaborative Filtering
  33. 33. Attribute-based Recommendations ■ Example: Match User Attributes to Item Attribute Fields /solr/select/?q=(grouptitle:”big data”^25 OR grouptitle:(java)^10) AND ((city:”Las Vegas” AND state:”NV”)^15 OR state:”NV”)”
  34. 34. Textual Similarity-based Recommendations ■ Solr’s MoreLikeThis Search Component. ■ Extracts important keywords from one or more documents and uses them in search. ■ This results in secondary search results which demonstrate textual similarity to the original document ■ http://wiki.apache.org/solr/MoreLikeThis
  35. 35. Content Recommendation ■ Even a single keyword can be enough to begin making meaningful recommendations. ■ Filtering or boosting results based upon geographical area or distance can help greatly for certain use cases: ▪ Jobs/Resumes, Events, Restaurants ■ /solr/select/?q=(Standard Recommendation Query) AND _val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”
  36. 36. Behavior Based Recommendation Approaches Collaborative Filtering : Uses who likes these also liked… ■ Step 1: Find similar users who like the same documents q=documentid: (“doc1” OR “doc4”) ■ Step 2: Search for docs “liked” by those similar users /solr/select/?q=userlikes: (“user5”^2 OR “user4”^2 OR “user1”^1)
  37. 37. Cloudera Search Architecture HDFS Online Streaming Data End User Client App (e.g. Hue) Flume Raw, filtered, or annotated data SolrCloud Cluster(s) NRT Data indexed w/ Morphlines Indexed data MapReduce Batch Indexing w/ Morphlines GoLive updates HBase Cluster NRT Replication Events indexed w/ Morphlines OLTP Data ClouderaManager Search queries
  38. 38. Storm & Recommendations
  39. 39. Real-time Architecture using Storm & Hadoop Key/Value StoreStorm Incoming Data Hadoop Query
  40. 40. Real-time and Storm ■ The query layer queries the real-time and batch and merges the result ■ Some algorithms are hard to implement in real time. For those cases we could estimate the results. ■ The model is generated offline on Hadoop and deployed into Storm. ■ Online learning algorithms can be used in Storm. They learn continuously through streaming training data. ■ Storm can also be used for scoring.
  41. 41. Storm/Track Realtime Events ■ Real-time streaming analytics/stats on consumer viewing behavior and digital content trends. ■ Track impressions, clicks, conversions, bid requests etc. in real time. Push per minute aggregations to HBase. ■ Most Popular Searches/Downloads/News Articles/Movies/Products
  42. 42. Training of Models A/B Testing
  43. 43. Offline Training & Testing of Models Use Cases • Recommend Missing Links in a Social Network • How are users connected • Clustering – find related people in groups • Iterative Graph Ranking Hadoop provides an excellent platform to train and test out the Models and various Algorithms Model Train Test Training Set Test Set Score
  44. 44. A/B Testing Traffic New Model Old Model X% (100-X)% A/B testing is used to test the performance of the Models online A/B testing involves: • Partitioning real traffic to two models and then measuring the performance to the desired result (maximize CTR, revenue, page views etc.). • The partitioning logic can get complicated. In such cases they can be pre-computed on Hadoop offline and pushed to an online store.
  45. 45. Please complete the session evaluation on the mobile app We appreciate your feedback and insight
  46. 46. Trends, Aggregations & Counters • Most Popular Searches/Downloads/News Articles/Movies/Products • Load results into HBase • Use HBase where we need NRT count of things (categories/products etc.) • Impala is very useful here for faster SLAs HBase Counters • Has concept of Incrementing column values • Avoids lock row/read value/increment it/write it back/unlock rows • Great for counting specific metrics over time • Example - count per URL/Product • Can disable write to WAL on puts

×