SlideShare a Scribd company logo
Focused Expertise                     Industries Served

 • Data Warehouse Design              •   Healthcare / Insurance
 • Business Intelligence              •   Financial Services
 • Big Data Analytics                 •   Retail / eCommerce
 • Search / Relevance                 •   Digital Media / Marketing
 • Infographics                       •   K-12 / Higher Education

 445 Park Ave New York, NY | 1-855-755-2246 | info@casertaconcepts.com

Big Data
Analytics
Recommendations
• Your customers expect them
   • Good recommendations make life easier
   • Help them find information, products, and services they might not
     have thought of


• What makes a good recommendation?
  • Relevant but not obvious
  • Sense of “surprise”



      SOLD!!             23” LED TV   24” LED TV     25” LED TV



      23” LED TV``


                           Blu-Ray    Home Theater   HDMI Cables
Where can recommendations
engines be found?
• Applications can be found in a wide variety of industries
 and applications:
  • Travel
  • Service Industry
  • Music/Online radio
  • TV and Video
  • Online Publications
  • Retail
   ..and countless others


   Our Use Case: Movie Ratings!
Our Goal
• Create a powerful, scalable recommendation engine with minimal
 development

• Make recommendations to users as they are browsing movie titles -
 instantaneously

• Recommendation must have context to the movie they are currently
 viewing.
                       OOPS! – too much surprise!
How do we hope to accomplish this?
Hadoop – distributed file system and processing platform
Mahout – collection of machine learning libraries

We will leverage 2 algorithms:
• Item Similarity– how similar is this particular movie to other
  movies based on usage
• Item-Based Recommender – predict an individuals
  preference based on their peers ratings

• Both algorithms only require a simple dataset of 3 fields:
 “User ID” , “Item ID”, “Rating”
Item Similarity – Context, Content Filtering
“People who liked this movie liked these as well”

• Item Similarity builds a matrix of items to other items and calculates
 similarity (based on user rating)

• The most similar item are then output as a list:
  • Item ID, Similar Item ID, Similarity Score
  • Items with the highest score are most similar
  • In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)

               7         100        0.690951001800917
               7         50         0.653299445638532
               7         117        0.643701303640083
Item-Base – Peer, Collaborative Filtering
“People with similar taste to you liked these movies”
• Item-Base takes the Item Similarity matrix and weights based on
 “peer” user preference.

• Essentially it determines the best movie critics for you to follow


• The items with the highest recommendation score are then output as tuples
  • User ID [Item ID1:Score,…., Item IDn:Score]
  • Items with the highest recommendation score are the most relevant to this user
  • For user “Johny Sisklebert” (572), the two most highly recommended movies are
     “Seven” and “Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
Recommendation Store
• Serving recommendations needs to be instantaneous
    We need a database!

• The core to this solution is two reference tables:


    Rec_Item_Similarity          Rec_User_Item_Base
    Item_ID                      User_ID
    Similar_Item                 Item_ID
    Similarity_Score             Recommendation_Score


• When called to make recommendations we query our store
  • Rec_Item_Similarity based on the Item_ID they are viewing
  • Rec_User_Item_Base based on their User_ID
Delivering Recommendations
    So if Johny is viewing “12 Monkeys” we query our
    recommendation store and present the results
       Item Similarity      Raw Score     Score
                                                       Item-Base (Peer)           Raw Score Score
Fargo                             0.691        1.000
                                                       Seven                             5.000      1.000
Star Wars                         0.653        0.946
                                                       Donnie Brasco                     4.707
                                                                                  Item-Based:       0.941
Rock, The                         0.644        0.932
                                                       Babe                              4.688      0.938
Pulp Fiction                      0.628        0.909                             Peers like these
                                                       Heat                              4.688      0.938
Return of the Jedi                0.627        0.908                                 Movies
                                                       To Kill a Mockingbird             4.686      0.937
Independence Day                  0.618        0.894
                                                       Jaws                              4.683      0.937
Willy Wonka                       0.603        0.872
                                                       Monty Python, Holy Grail          4.670      0.934
Mission: Impossible               0.597        0.864                                  Best
                                                       Blade Runner                      4.670      0.934
Silence of the Lambs, The         0.596        0.863
                                                       Get Shorty
                                                                                Recommendations
                                                                                         4.655      0.931
Star Trek: First Contact          0.594        0.859
Raiders of the Lost Ark           0.584        0.845
Terminator, The                   0.574        0.831       Top 10 Recommendations
Blade Runner                      0.571        0.826
Usual Suspects, The               0.569        0.823      Seven (Se7en)                    1.823
Seven (Se7en)                     0.569        0.823      Blade Runner                     1.760
                                                          Fargo                            1.000
                                                          Star Wars                        0.946
                                                          Donnie Brasco                    0.941
                                                          Babe                             0.938
                                                          Heat                             0.938
                                                          To Kill a Mockingbird            0.937
                                                          Jaws                             0.937
                                                          Monty Python, Holy Grail         0.934
From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
    …but the 6th result would have been “Babe” the children's movie
                                                   OOPS!




• Tuning the algorithms might help: parameter changes, similarity
 measures.

• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means, or Fuzzy K-Means
Delivery Scoring and Filters
   Apply assumptions to control the results of collaborative filtering
   • One or more categories must match
   • Only children movies will be recommended for children's movies.


                        Action   Adventure Children's Comedy   Crime   Drama   Film-Noir   Horror   Romance   Sci-Fi   Thriller
Twelve Monkeys            0         0         0        0        0       1         0          0        0        1          0
Babe                      0         0         1        1        0       1         0          0        0        0          0
Seven (Se7en)             0         0         0        0        1       1         0          0        0        0          1
Star Wars                 1         1         0        0        0       0         0          0        1        1          0
Blade Runner              0         0         0        0        0       0         1          0        0        1          0
Fargo                     0         0         0        0        1       1         0          0        0        0          1
Willy Wonka               0         1         1        1        0       0         0          0        0        0          0
Monty Python              0         0         0        1        0       0         0          0        0        0          0
Jaws                      1         0         0        0        0       0         0          1        0        0          0
Heat                      1         0         0        0        1       0         0          0        0        0          1
Donnie Brasco             0         0         0        0        1       1         0          0        0        0          0
To Kill a Mockingbird     0         0         0        0        0       1         0          0        0        0          0


     Similarly logic could be applied to promote more favorable options
     • New Releases
     • Retail Case: Items that are on-sale, overstock
Additional Algorithm – K-Means
  “These movies are similar based on their attributes”

 • Treats items as coordinates
 • Places a number of random
   “centroids” and assigns the
   nearest items
 • Moves the centroids around based
   on average location
 • Process repeats until the
   assignments stop changing


We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text
Integrating K-Means into the process
Movies recommended by more than 1 algorithm are the most highly rated




                                                K-Means:
                 Item-Based                      Similar




                              Item Similarity

                                                           Best
                                                      Recommendations
Summary
• Mahout and Hadoop can provide a relatively low cost and
 extremely scalable platform for recommendations

• Mahout offers a great library of established Machine Learning
 libraries, reducing development efforts

• A good recommendation system combines Collaborative and
 Content filtering algorithms


             elliott@casertaconcepts.com

More Related Content

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
Caserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
Caserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 

Webinar Presentation: Building a Big Data Recommendation Engine

  • 1. Focused Expertise Industries Served • Data Warehouse Design • Healthcare / Insurance • Business Intelligence • Financial Services • Big Data Analytics • Retail / eCommerce • Search / Relevance • Digital Media / Marketing • Infographics • K-12 / Higher Education 445 Park Ave New York, NY | 1-855-755-2246 | info@casertaconcepts.com Big Data Analytics
  • 2. Recommendations • Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of • What makes a good recommendation? • Relevant but not obvious • Sense of “surprise” SOLD!! 23” LED TV 24” LED TV 25” LED TV 23” LED TV`` Blu-Ray Home Theater HDMI Cables
  • 3. Where can recommendations engines be found? • Applications can be found in a wide variety of industries and applications: • Travel • Service Industry • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others Our Use Case: Movie Ratings!
  • 4. Our Goal • Create a powerful, scalable recommendation engine with minimal development • Make recommendations to users as they are browsing movie titles - instantaneously • Recommendation must have context to the movie they are currently viewing. OOPS! – too much surprise!
  • 5. How do we hope to accomplish this? Hadoop – distributed file system and processing platform Mahout – collection of machine learning libraries We will leverage 2 algorithms: • Item Similarity– how similar is this particular movie to other movies based on usage • Item-Based Recommender – predict an individuals preference based on their peers ratings • Both algorithms only require a simple dataset of 3 fields: “User ID” , “Item ID”, “Rating”
  • 6. Item Similarity – Context, Content Filtering “People who liked this movie liked these as well” • Item Similarity builds a matrix of items to other items and calculates similarity (based on user rating) • The most similar item are then output as a list: • Item ID, Similar Item ID, Similarity Score • Items with the highest score are most similar • In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100) 7 100 0.690951001800917 7 50 0.653299445638532 7 117 0.643701303640083
  • 7. Item-Base – Peer, Collaborative Filtering “People with similar taste to you liked these movies” • Item-Base takes the Item Similarity matrix and weights based on “peer” user preference. • Essentially it determines the best movie critics for you to follow • The items with the highest recommendation score are then output as tuples • User ID [Item ID1:Score,…., Item IDn:Score] • Items with the highest recommendation score are the most relevant to this user • For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and “Donnie Brasco” 572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515] 573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019] 574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
  • 8. Recommendation Store • Serving recommendations needs to be instantaneous We need a database! • The core to this solution is two reference tables: Rec_Item_Similarity Rec_User_Item_Base Item_ID User_ID Similar_Item Item_ID Similarity_Score Recommendation_Score • When called to make recommendations we query our store • Rec_Item_Similarity based on the Item_ID they are viewing • Rec_User_Item_Base based on their User_ID
  • 9. Delivering Recommendations So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results Item Similarity Raw Score Score Item-Base (Peer) Raw Score Score Fargo 0.691 1.000 Seven 5.000 1.000 Star Wars 0.653 0.946 Donnie Brasco 4.707 Item-Based: 0.941 Rock, The 0.644 0.932 Babe 4.688 0.938 Pulp Fiction 0.628 0.909 Peers like these Heat 4.688 0.938 Return of the Jedi 0.627 0.908 Movies To Kill a Mockingbird 4.686 0.937 Independence Day 0.618 0.894 Jaws 4.683 0.937 Willy Wonka 0.603 0.872 Monty Python, Holy Grail 4.670 0.934 Mission: Impossible 0.597 0.864 Best Blade Runner 4.670 0.934 Silence of the Lambs, The 0.596 0.863 Get Shorty Recommendations 4.655 0.931 Star Trek: First Contact 0.594 0.859 Raiders of the Lost Ark 0.584 0.845 Terminator, The 0.574 0.831 Top 10 Recommendations Blade Runner 0.571 0.826 Usual Suspects, The 0.569 0.823 Seven (Se7en) 1.823 Seven (Se7en) 0.569 0.823 Blade Runner 1.760 Fargo 1.000 Star Wars 0.946 Donnie Brasco 0.941 Babe 0.938 Heat 0.938 To Kill a Mockingbird 0.937 Jaws 0.937 Monty Python, Holy Grail 0.934
  • 10. From Good to Great Recommendations • Note that the first 5 recommendations look pretty good …but the 6th result would have been “Babe” the children's movie OOPS! • Tuning the algorithms might help: parameter changes, similarity measures. • How else can we make it better? 1. Delivery filters 2. Introduce additional algorithms such as K-Means, or Fuzzy K-Means
  • 11. Delivery Scoring and Filters Apply assumptions to control the results of collaborative filtering • One or more categories must match • Only children movies will be recommended for children's movies. Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0 Babe 0 0 1 1 0 1 0 0 0 0 0 Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1 Star Wars 1 1 0 0 0 0 0 0 1 1 0 Blade Runner 0 0 0 0 0 0 1 0 0 1 0 Fargo 0 0 0 0 1 1 0 0 0 0 1 Willy Wonka 0 1 1 1 0 0 0 0 0 0 0 Monty Python 0 0 0 1 0 0 0 0 0 0 0 Jaws 1 0 0 0 0 0 0 1 0 0 0 Heat 1 0 0 0 1 0 0 0 0 0 1 Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0 To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0 Similarly logic could be applied to promote more favorable options • New Releases • Retail Case: Items that are on-sale, overstock
  • 12. Additional Algorithm – K-Means “These movies are similar based on their attributes” • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing We would use the major attributes of the Movie to create coordinate points. • Categories • Actors • Director • Synopsis Text
  • 13. Integrating K-Means into the process Movies recommended by more than 1 algorithm are the most highly rated K-Means: Item-Based Similar Item Similarity Best Recommendations
  • 14. Summary • Mahout and Hadoop can provide a relatively low cost and extremely scalable platform for recommendations • Mahout offers a great library of established Machine Learning libraries, reducing development efforts • A good recommendation system combines Collaborative and Content filtering algorithms elliott@casertaconcepts.com