SlideShare a Scribd company logo
#AnalyticsStreet @joe_Caserta 
Building a 
Recommendation 
Engine on Spark 
Joe Caserta 
President, Caserta Concepts 
joe@casertaconcepts.com 
(914) 261-3648 
@joe_Caserta
About Caserta Concepts 
• Technology services company with expertise in data analysis: 
• Big Data Solutions 
• Data Warehousing 
• Business Intelligence 
• Data Science & Analytics 
• Data on the Cloud 
• Data Interaction & Visualization 
• Core focus in the following industries: 
• eCommerce / Retail / Marketing 
• Financial Services / Insurance 
• Healthcare / Ad Tech / Higher Ed 
• Established in 2001: 
• Increased growth year-over-year 
• Industry recognized work force 
• Strategy, Implementation 
• Writing, Education, Mentoring 
#AnalyticsStreet @joe_Caserta
Why Big Data? 
Enrollments 
Claims 
Finance 
ETL 
Traditional 
EDW 
Big Data Cluster 
#AnalyticsStreet @joe_Caserta 
Big Data Analytics 
Ad-Hoc Query 
Traditional BI 
Horizontally Scalable Environment - Optimized for Analytics 
Canned Reporting 
NoSQL 
Databases 
ETL 
Ad-Hoc/Canned 
Reporting 
Spark MapReduce Pig/Hive 
N1 N2 N3 N4 N5 
Hadoop Distributed File System (HDFS) 
Others… 
Data Science
What is Spark 
• Spark is a fast, general-purpose cluster computing framework. 
• Sits on top of Hadoop 
• Up to 100 times faster than Map Reduce 
• In-memory cluster computing – well suited for machine learning 
• Provides high-level APIs in Java, Scala and Python. Tools include: 
• Spark SQL 
• MLlib 
• GraphX 
Data Science Training: 
• Spark Streaming https://exploredatascience.com/ 
#AnalyticsStreet @joe_Caserta
Project Objective 
• Create a functional recommendation engine to surface to provide 
relevant product recommendations to customers. 
• Improve Customer Experience 
• Increase Customer Retention 
• Increase Customer Purchase Activity 
• Establish Hadoop with Spark as a high performance, scalable solution 
for computing and storage 
• Accurately suggest relevant products to customers based on their peer 
behavior. Integrate existing EDW data with Hadoop natively using an 
enterprise class ETL tool 
• Implement an enterprise class business intelligence tool sourcing 
directly from Hadoop 
#AnalyticsStreet @joe_Caserta
Hadoop Environment 
• Lab Setup 
• 10 node cluster - Cloudera 
• 1 TB under management with inexpensive commodity hardware 
• ETL – Talend 
• Load data from Enterprise Data Warehouse into Hadoop 
• Efficacy Reporting - Datameer 
• Recommendation Engine Built and Tested 
• Recommendations are as good or better than anticipated 
• More relevant than possible without Big Data solution 
• Algorithms can easily be fine-tuned by adjusting: 
• The number of recommendations in the results 
• The weighting of the relevancy of the Product 
#AnalyticsStreet @joe_Caserta
The Math Behind Relevance 
• Finding ‘Similar’ Objects 
Cosine Similarity  
• Value of cos θ varies between: 
Figure. Vectors A & B 
• -1 [‘θ’ = 180◦, Absolutely dissimilar – Opposite ended vectors/relationship] 
• 0 [‘θ’ = 90◦, Dissimilar, perpendicular vectors/relationship] 
• +1 [‘θ’ = 0◦, Absolutely Similar – Overlapping vectors/relationship] 
#AnalyticsStreet @joe_Caserta
Recommendations 
• Your customers expect them 
• Good recommendations make life easier 
• Help them find information, products, and services they might not have 
thought of 
• What makes a good recommendation? 
• Relevant but not obvious 
• Sense of “surprise” 
23” LED TV 24” LED TV 25” LED TV 
SOLD!! 
23” LED TV`` 
Blu-Ray Home Theater HDMI Cables 
#AnalyticsStreet @joe_Caserta
Where do we use recommendations? 
• Applications can be found in a wide variety of industries and applications: 
• Travel 
• Financial Service 
• Music/Online radio 
• TV and Video 
• Online Publications 
• Retail 
..and countless others 
Our Example: Movies 
#AnalyticsStreet @joe_Caserta
Our Goal 
• Create a powerful, scalable recommendation engine with minimal 
development 
• Make recommendations to users as they are browsing movie titles - 
instantaneously 
• Recommendation must have context to the movie they are currently 
viewing. 
OOPS! – too much surprise! 
#AnalyticsStreet @joe_Caserta
How do we do it? 
Hadoop – distributed file system and processing platform 
Spark – low-latency computing 
MLlib – Library of Machine Learning Algorithms 
We leverage two algorithms: 
• Content-Based Filtering – how similar is this particular movie to other 
movies based on usage. 
• Collaborative Filtering – predict an individuals preference based on their 
peers ratings. Spark MLlib implements a collaborative filtering algorithm 
called Alternating Least Squares (ALS) 
• Both algorithms only require a simple dataset of 3 fields: 
“User ID” , “Item ID”, “Rating” 
#AnalyticsStreet @joe_Caserta
Content-Based Filtering 
“People who liked this movie liked these as well” 
• Content Based Filter builds a matrix of items to other items and 
calculates similarity (based on user rating) 
• The most similar item are then output as a list: 
• Item ID, Similar Item ID, Similarity Score 
• Items with the highest score are most similar 
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100) 
7 100 0.690951001800917 
7 50 0.653299445638532 
7 117 0.643701303640083 
At the moment, content based filtering is not available for 
Spark in Mllib. On our project, we used Mahout. 
#AnalyticsStreet @joe_Caserta
Collaborative Filtering 
“People with similar taste to you liked these movies” 
• Collaborative filtering applies weights based on “peer” user preference. 
• Essentially it determines the best movie critics for you to follow 
• The items with the highest recommendation score are then output as tuples 
• User ID [Item ID1:Score,…., Item IDn:Score] 
• Items with the highest recommendation score are the most relevant to this user 
• For user “Johny Sisklebert” (572), the two most highly recommended movies are 
“Seven” and “Donnie Brasco” 
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515] 
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019] 
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0] 
#AnalyticsStreet @joe_Caserta
Recommendation Store 
• Serving recommendations needs to be instantaneous 
• The core to this solution is two reference tables: 
Rec_Item_Similarity 
Item_ID 
Similar_Item 
Similarity_Score 
Rec_User_Item_Base 
User_ID 
Item_ID 
Recommendation_Score 
• When called to make recommendations we query our store 
• Rec_Item_Similarity based on the Item_ID they are viewing 
• Rec_User_Item_Base based on their User_ID 
#AnalyticsStreet @joe_Caserta
Delivering Recommendations 
So if Johny is viewing “12 Monkeys” we query our recommendation store 
and present the results 
#AnalyticsStreet @joe_Caserta 
Item-Based: 
Peers like these 
Movies 
Best 
Recommendations 
Item Similarity Raw Score Score 
Fargo 0.691 1.000 
Star Wars 0.653 0.946 
Rock, The 0.644 0.932 
Pulp Fiction 0.628 0.909 
Return of the Jedi 0.627 0.908 
Independence Day 0.618 0.894 
Willy Wonka 0.603 0.872 
Mission: Impossible 0.597 0.864 
Silence of the Lambs, The 0.596 0.863 
Star Trek: First Contact 0.594 0.859 
Raiders of the Lost Ark 0.584 0.845 
Terminator, The 0.574 0.831 
Blade Runner 0.571 0.826 
Usual Suspects, The 0.569 0.823 
Seven (Se7en) 0.569 0.823 
Item-Base (Peer) Raw Score Score 
Seven 5.000 1.000 
Donnie Brasco 4.707 0.941 
Babe 4.688 0.938 
Heat 4.688 0.938 
To Kill a Mockingbird 4.686 0.937 
Jaws 4.683 0.937 
Monty Python, Holy Grail 4.670 0.934 
Blade Runner 4.670 0.934 
Get Shorty 4.655 0.931 
Top 10 Recommendations 
Seven (Se7en) 1.823 
Blade Runner 1.760 
Fargo 1.000 
Star Wars 0.946 
Donnie Brasco 0.941 
Babe 0.938 
Heat 0.938 
To Kill a Mockingbird 0.937 
Jaws 0.937 
Monty Python, Holy Grail 0.934
From Good to Great Recommendations 
• Note that the first 5 recommendations look pretty good 
…but the 6th result would have been “Babe” the children's movie 
• Tuning the algorithms might help: parameter changes, similarity measures. 
• How else can we make it better? 
1. Delivery filters 
2. Introduce additional algorithms such as K-Means 
#AnalyticsStreet @joe_Caserta 
OOPS!
Additional Algorithm – K-Means 
“These movies are similar based on their attributes” 
• Treats items as coordinates 
• Places a number of random 
“centroids” and assigns the nearest 
items 
• Moves the centroids around based on 
average location 
• Process repeats until the assignments 
stop changing 
We would use the major attributes of the Movie to create coordinate points. 
• Categories 
• Actors 
• Director 
• Synopsis Text 
#AnalyticsStreet @joe_Caserta
Delivery Scoring and Filters 
Apply assumptions to control the results of collaborative filtering 
• One or more categories must match 
• Only children movies will be recommended for children's movies. 
Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller 
Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0 
Babe 0 0 1 1 0 1 0 0 0 0 0 
Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1 
Star Wars 1 1 0 0 0 0 0 0 1 1 0 
Blade Runner 0 0 0 0 0 0 1 0 0 1 0 
Fargo 0 0 0 0 1 1 0 0 0 0 1 
Willy Wonka 0 1 1 1 0 0 0 0 0 0 0 
Monty Python 0 0 0 1 0 0 0 0 0 0 0 
Jaws 1 0 0 0 0 0 0 1 0 0 0 
Heat 1 0 0 0 1 0 0 0 0 0 1 
Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0 
To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0 
Similarly logic could be applied to promote more favorable options 
• New Releases 
• Retail Case: Items that are on-sale, overstock 
#AnalyticsStreet @joe_Caserta
Integrating K-Means into the process 
Movies recommended by more than 1 algorithm are the most highly rated 
Collaborative Filter 
K-Means: 
Similar 
Content Filter 
#AnalyticsStreet @joe_Caserta 
Best 
Recommendations
Sophisticated Recommendation Model 
20 
What are people 
with similar 
characteristics 
buying? 
#AnalyticsStreet @joe_Caserta 
What items are we 
promoting at time 
of sale? 
What items are 
being promoted 
by the Store or 
Market? 
20 
Peer Based 
Item 
Clustering 
Corporate 
Deals/ 
Offers 
Customer 
Behavior 
Market/ 
Store 
Recommendation 
What items have 
you bought in the 
past? 
What did people 
who ordered 
these items also 
order? 
The solution 
allows balancing 
of algorithms to 
attain the most 
effective 
recommendation
Summary 
• Hadoop and Spark can provide a relatively low cost and extremely 
scalable platform for recommendations 
• Spark, with MLlib offers a great library of established Machine 
Learning algorithms, reducing development efforts 
• A good recommendation system combines Collaborative and Content 
filtering algorithms and custom business rules 
• As Spark matures, Mahout or roll-your-own algorithms may be 
needed. 
#AnalyticsStreet @joe_Caserta
Thank You 
Joe Caserta 
President, Caserta Concepts 
joe@casertaconcepts.com 
(914) 261-3648 
@joe_Caserta 
#AnalyticsStreet @joe_Caserta

More Related Content

What's hot

Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Introduction to Recommendation Systems
Introduction to Recommendation SystemsIntroduction to Recommendation Systems
Introduction to Recommendation Systems
Trieu Nguyen
 
Movie lens movie recommendation system
Movie lens movie recommendation systemMovie lens movie recommendation system
Movie lens movie recommendation system
Gaurav Sawant
 
Building an Implicit Recommendation Engine with Spark with Sophie Watson
Building an Implicit Recommendation Engine with Spark with Sophie WatsonBuilding an Implicit Recommendation Engine with Spark with Sophie Watson
Building an Implicit Recommendation Engine with Spark with Sophie Watson
Databricks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
blueace
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
David Zibriczky
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
NYC Predictive Analytics
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
Liang Xiang
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
Crossing Minds
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Girish Khanzode
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
David Stein
 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
Eugene Yan Ziyou
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
James Kirk
 
Recommendation System
Recommendation SystemRecommendation System
Recommendation System
Anamta Sayyed
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
Amazon Web Services
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
DATAVERSITY
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
Pranav Prakash
 

What's hot (20)

Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Introduction to Recommendation Systems
Introduction to Recommendation SystemsIntroduction to Recommendation Systems
Introduction to Recommendation Systems
 
Movie lens movie recommendation system
Movie lens movie recommendation systemMovie lens movie recommendation system
Movie lens movie recommendation system
 
Building an Implicit Recommendation Engine with Spark with Sophie Watson
Building an Implicit Recommendation Engine with Spark with Sophie WatsonBuilding an Implicit Recommendation Engine with Spark with Sophie Watson
Building an Implicit Recommendation Engine with Spark with Sophie Watson
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
Recommendation System
Recommendation SystemRecommendation System
Recommendation System
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
 

Viewers also liked

Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
DataStax
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Xavier Amatriain
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
Neil Mathew
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
Milind Gokhale
 

Viewers also liked (6)

Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 

Similar to How to Build a Recommendation Engine on Spark

Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditMaking Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Lucidworks
 
Acceptance, accessible, actionable and auditable
Acceptance, accessible, actionable and auditableAcceptance, accessible, actionable and auditable
Acceptance, accessible, actionable and auditable
Alban Gérôme
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
BigDataCloud
 
NYC Data Driven Business Meetup - 2.7.17
NYC Data Driven Business Meetup - 2.7.17NYC Data Driven Business Meetup - 2.7.17
NYC Data Driven Business Meetup - 2.7.17
Karl Pawlewicz
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWS
Amazon Web Services
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of Data
Matt Turner
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 
Partner Webinar: Recommendation Engines with MongoDB and Hadoop
 Partner Webinar: Recommendation Engines with MongoDB and Hadoop Partner Webinar: Recommendation Engines with MongoDB and Hadoop
Partner Webinar: Recommendation Engines with MongoDB and Hadoop
MongoDB
 
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIsDashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Luciano Pesci, PhD
 
Introduction to apache spark and machine learning
Introduction to apache spark and machine learningIntroduction to apache spark and machine learning
Introduction to apache spark and machine learning
Awoyemi Ezekiel
 
Design Recommender systems from scratch
Design Recommender systems from scratchDesign Recommender systems from scratch
Design Recommender systems from scratch
Dr. Amit Sachan
 
danmcclary-pspresentation-katieboyle-171030115522.pdf
danmcclary-pspresentation-katieboyle-171030115522.pdfdanmcclary-pspresentation-katieboyle-171030115522.pdf
danmcclary-pspresentation-katieboyle-171030115522.pdf
ssuser3ee399
 
Why Big and Small Data Is Important by Google's Product Manager
Why Big and Small Data Is Important by Google's Product ManagerWhy Big and Small Data Is Important by Google's Product Manager
Why Big and Small Data Is Important by Google's Product Manager
Product School
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
Dr. Wilfred Lin (Ph.D.)
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantage
John Repko
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
Max De Marzi
 
Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015
Max De Marzi
 
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
Amazon Web Services
 
Content marketing workshop--Tech Media
Content marketing workshop--Tech MediaContent marketing workshop--Tech Media
Content marketing workshop--Tech Media
Benjamin Barbrey
 
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
multimediaeval
 

Similar to How to Build a Recommendation Engine on Spark (20)

Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditMaking Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
 
Acceptance, accessible, actionable and auditable
Acceptance, accessible, actionable and auditableAcceptance, accessible, actionable and auditable
Acceptance, accessible, actionable and auditable
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
 
NYC Data Driven Business Meetup - 2.7.17
NYC Data Driven Business Meetup - 2.7.17NYC Data Driven Business Meetup - 2.7.17
NYC Data Driven Business Meetup - 2.7.17
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWS
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of Data
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Partner Webinar: Recommendation Engines with MongoDB and Hadoop
 Partner Webinar: Recommendation Engines with MongoDB and Hadoop Partner Webinar: Recommendation Engines with MongoDB and Hadoop
Partner Webinar: Recommendation Engines with MongoDB and Hadoop
 
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIsDashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIs
 
Introduction to apache spark and machine learning
Introduction to apache spark and machine learningIntroduction to apache spark and machine learning
Introduction to apache spark and machine learning
 
Design Recommender systems from scratch
Design Recommender systems from scratchDesign Recommender systems from scratch
Design Recommender systems from scratch
 
danmcclary-pspresentation-katieboyle-171030115522.pdf
danmcclary-pspresentation-katieboyle-171030115522.pdfdanmcclary-pspresentation-katieboyle-171030115522.pdf
danmcclary-pspresentation-katieboyle-171030115522.pdf
 
Why Big and Small Data Is Important by Google's Product Manager
Why Big and Small Data Is Important by Google's Product ManagerWhy Big and Small Data Is Important by Google's Product Manager
Why Big and Small Data Is Important by Google's Product Manager
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantage
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015
 
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
 
Content marketing workshop--Tech Media
Content marketing workshop--Tech MediaContent marketing workshop--Tech Media
Content marketing workshop--Tech Media
 
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
Caserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
Caserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Recently uploaded

Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 

Recently uploaded (20)

Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 

How to Build a Recommendation Engine on Spark

  • 1. #AnalyticsStreet @joe_Caserta Building a Recommendation Engine on Spark Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta
  • 2. About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring #AnalyticsStreet @joe_Caserta
  • 3. Why Big Data? Enrollments Claims Finance ETL Traditional EDW Big Data Cluster #AnalyticsStreet @joe_Caserta Big Data Analytics Ad-Hoc Query Traditional BI Horizontally Scalable Environment - Optimized for Analytics Canned Reporting NoSQL Databases ETL Ad-Hoc/Canned Reporting Spark MapReduce Pig/Hive N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) Others… Data Science
  • 4. What is Spark • Spark is a fast, general-purpose cluster computing framework. • Sits on top of Hadoop • Up to 100 times faster than Map Reduce • In-memory cluster computing – well suited for machine learning • Provides high-level APIs in Java, Scala and Python. Tools include: • Spark SQL • MLlib • GraphX Data Science Training: • Spark Streaming https://exploredatascience.com/ #AnalyticsStreet @joe_Caserta
  • 5. Project Objective • Create a functional recommendation engine to surface to provide relevant product recommendations to customers. • Improve Customer Experience • Increase Customer Retention • Increase Customer Purchase Activity • Establish Hadoop with Spark as a high performance, scalable solution for computing and storage • Accurately suggest relevant products to customers based on their peer behavior. Integrate existing EDW data with Hadoop natively using an enterprise class ETL tool • Implement an enterprise class business intelligence tool sourcing directly from Hadoop #AnalyticsStreet @joe_Caserta
  • 6. Hadoop Environment • Lab Setup • 10 node cluster - Cloudera • 1 TB under management with inexpensive commodity hardware • ETL – Talend • Load data from Enterprise Data Warehouse into Hadoop • Efficacy Reporting - Datameer • Recommendation Engine Built and Tested • Recommendations are as good or better than anticipated • More relevant than possible without Big Data solution • Algorithms can easily be fine-tuned by adjusting: • The number of recommendations in the results • The weighting of the relevancy of the Product #AnalyticsStreet @joe_Caserta
  • 7. The Math Behind Relevance • Finding ‘Similar’ Objects Cosine Similarity  • Value of cos θ varies between: Figure. Vectors A & B • -1 [‘θ’ = 180◦, Absolutely dissimilar – Opposite ended vectors/relationship] • 0 [‘θ’ = 90◦, Dissimilar, perpendicular vectors/relationship] • +1 [‘θ’ = 0◦, Absolutely Similar – Overlapping vectors/relationship] #AnalyticsStreet @joe_Caserta
  • 8. Recommendations • Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of • What makes a good recommendation? • Relevant but not obvious • Sense of “surprise” 23” LED TV 24” LED TV 25” LED TV SOLD!! 23” LED TV`` Blu-Ray Home Theater HDMI Cables #AnalyticsStreet @joe_Caserta
  • 9. Where do we use recommendations? • Applications can be found in a wide variety of industries and applications: • Travel • Financial Service • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others Our Example: Movies #AnalyticsStreet @joe_Caserta
  • 10. Our Goal • Create a powerful, scalable recommendation engine with minimal development • Make recommendations to users as they are browsing movie titles - instantaneously • Recommendation must have context to the movie they are currently viewing. OOPS! – too much surprise! #AnalyticsStreet @joe_Caserta
  • 11. How do we do it? Hadoop – distributed file system and processing platform Spark – low-latency computing MLlib – Library of Machine Learning Algorithms We leverage two algorithms: • Content-Based Filtering – how similar is this particular movie to other movies based on usage. • Collaborative Filtering – predict an individuals preference based on their peers ratings. Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS) • Both algorithms only require a simple dataset of 3 fields: “User ID” , “Item ID”, “Rating” #AnalyticsStreet @joe_Caserta
  • 12. Content-Based Filtering “People who liked this movie liked these as well” • Content Based Filter builds a matrix of items to other items and calculates similarity (based on user rating) • The most similar item are then output as a list: • Item ID, Similar Item ID, Similarity Score • Items with the highest score are most similar • In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100) 7 100 0.690951001800917 7 50 0.653299445638532 7 117 0.643701303640083 At the moment, content based filtering is not available for Spark in Mllib. On our project, we used Mahout. #AnalyticsStreet @joe_Caserta
  • 13. Collaborative Filtering “People with similar taste to you liked these movies” • Collaborative filtering applies weights based on “peer” user preference. • Essentially it determines the best movie critics for you to follow • The items with the highest recommendation score are then output as tuples • User ID [Item ID1:Score,…., Item IDn:Score] • Items with the highest recommendation score are the most relevant to this user • For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and “Donnie Brasco” 572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515] 573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019] 574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0] #AnalyticsStreet @joe_Caserta
  • 14. Recommendation Store • Serving recommendations needs to be instantaneous • The core to this solution is two reference tables: Rec_Item_Similarity Item_ID Similar_Item Similarity_Score Rec_User_Item_Base User_ID Item_ID Recommendation_Score • When called to make recommendations we query our store • Rec_Item_Similarity based on the Item_ID they are viewing • Rec_User_Item_Base based on their User_ID #AnalyticsStreet @joe_Caserta
  • 15. Delivering Recommendations So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results #AnalyticsStreet @joe_Caserta Item-Based: Peers like these Movies Best Recommendations Item Similarity Raw Score Score Fargo 0.691 1.000 Star Wars 0.653 0.946 Rock, The 0.644 0.932 Pulp Fiction 0.628 0.909 Return of the Jedi 0.627 0.908 Independence Day 0.618 0.894 Willy Wonka 0.603 0.872 Mission: Impossible 0.597 0.864 Silence of the Lambs, The 0.596 0.863 Star Trek: First Contact 0.594 0.859 Raiders of the Lost Ark 0.584 0.845 Terminator, The 0.574 0.831 Blade Runner 0.571 0.826 Usual Suspects, The 0.569 0.823 Seven (Se7en) 0.569 0.823 Item-Base (Peer) Raw Score Score Seven 5.000 1.000 Donnie Brasco 4.707 0.941 Babe 4.688 0.938 Heat 4.688 0.938 To Kill a Mockingbird 4.686 0.937 Jaws 4.683 0.937 Monty Python, Holy Grail 4.670 0.934 Blade Runner 4.670 0.934 Get Shorty 4.655 0.931 Top 10 Recommendations Seven (Se7en) 1.823 Blade Runner 1.760 Fargo 1.000 Star Wars 0.946 Donnie Brasco 0.941 Babe 0.938 Heat 0.938 To Kill a Mockingbird 0.937 Jaws 0.937 Monty Python, Holy Grail 0.934
  • 16. From Good to Great Recommendations • Note that the first 5 recommendations look pretty good …but the 6th result would have been “Babe” the children's movie • Tuning the algorithms might help: parameter changes, similarity measures. • How else can we make it better? 1. Delivery filters 2. Introduce additional algorithms such as K-Means #AnalyticsStreet @joe_Caserta OOPS!
  • 17. Additional Algorithm – K-Means “These movies are similar based on their attributes” • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing We would use the major attributes of the Movie to create coordinate points. • Categories • Actors • Director • Synopsis Text #AnalyticsStreet @joe_Caserta
  • 18. Delivery Scoring and Filters Apply assumptions to control the results of collaborative filtering • One or more categories must match • Only children movies will be recommended for children's movies. Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0 Babe 0 0 1 1 0 1 0 0 0 0 0 Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1 Star Wars 1 1 0 0 0 0 0 0 1 1 0 Blade Runner 0 0 0 0 0 0 1 0 0 1 0 Fargo 0 0 0 0 1 1 0 0 0 0 1 Willy Wonka 0 1 1 1 0 0 0 0 0 0 0 Monty Python 0 0 0 1 0 0 0 0 0 0 0 Jaws 1 0 0 0 0 0 0 1 0 0 0 Heat 1 0 0 0 1 0 0 0 0 0 1 Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0 To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0 Similarly logic could be applied to promote more favorable options • New Releases • Retail Case: Items that are on-sale, overstock #AnalyticsStreet @joe_Caserta
  • 19. Integrating K-Means into the process Movies recommended by more than 1 algorithm are the most highly rated Collaborative Filter K-Means: Similar Content Filter #AnalyticsStreet @joe_Caserta Best Recommendations
  • 20. Sophisticated Recommendation Model 20 What are people with similar characteristics buying? #AnalyticsStreet @joe_Caserta What items are we promoting at time of sale? What items are being promoted by the Store or Market? 20 Peer Based Item Clustering Corporate Deals/ Offers Customer Behavior Market/ Store Recommendation What items have you bought in the past? What did people who ordered these items also order? The solution allows balancing of algorithms to attain the most effective recommendation
  • 21. Summary • Hadoop and Spark can provide a relatively low cost and extremely scalable platform for recommendations • Spark, with MLlib offers a great library of established Machine Learning algorithms, reducing development efforts • A good recommendation system combines Collaborative and Content filtering algorithms and custom business rules • As Spark matures, Mahout or roll-your-own algorithms may be needed. #AnalyticsStreet @joe_Caserta
  • 22. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta #AnalyticsStreet @joe_Caserta

Editor's Notes

  1. Robotman was actually the first cyborg superhero. Robert Crane was fatally shot and had his brain placed in a super strong robot body. The cybernetic Robotman lived on, using a rubber mask and flesh-like body suit to disguise himself as Paul Dennis. The new hero used his cyborg might to smash crime during DC’s Golden Age. First Appearance: Star Spangled Comics #7 (1942)
  2. Cloudera , Talend , Datameer
  3. - Need to talk through the vectors A & B and what we want to express