SlideShare a Scribd company logo
Big, Practical Recommendations
with Alternating Least Squares

Sean Owen • Apache Mahout / Myrrix.com
WHERE’S BIG LEARNING?
 Next: Application Layer
    Analytics
    Machine Learning
                                     Applications
 Like Apache Mahout
    Common Big Data app today       Processing
    Clustering, recommenders,
     classifiers on Hadoop            Database
    Free, open source; not mature

 Where’s commercialized               Storage
  Big Learning?
A RECOMMENDER SHOULD …
 Answer in Real-time                Accept Diverse Input
    Ingest new data, now               Not just people and products
    Modify recommendations based       Not just explicit ratings
      on newest data                    Clicks, views, buys
    No “cold start” for new data
                                        Side information
 Scale Horizontally                 Be “Pretty Accurate”
    For queries per second
    For size of data set
NEED: 2-TIER ARCHITECTURE
 Real-time Serving Layer
    Quick results based on
      precomputed model
    Incremental update
    Partitionable for scale

 Batch Computation Layer
    Builds model
    Scales out (on Hadoop?)
    Asynchronous, occasional,
      long-lived runs
A PRACTICAL ALGORITHM

MATRIX FACTORIZATION                BENEFITS
 Factor user-item matrix to         Models intuition
  user-feature + feature-item        Factorization is batch
  matrix                              parallelizable
 Well understood in ML, as:         Reconstruction (recs) in
    Principal Component Analysis     low-dimension is fast
    Latent Semantic Indexing
                                     Allows projection of new data
 Several algorithms, like:             Cold start solution
    Singular Value Decomposition       Approximate update solution
    Alternating Least Squares
A PRACTICAL IMPLEMENTATION
ALTERNATING LEAST
SQUARES                              BENEFITS
 Simple factorization P ≈ X YT       Parallelizable by row --
 Approximate: X, Y are                very Hadoop-friendly
  “skinny” (low-rank)                 Iterative: OK answer fast,
 Faster than the SVD                  refine as long as desired
    Trivially parallel, iterative    Yields to “binary” input model
 Dumber than the SVD                    Ratings as regularization
                                           instead
    No singular values,
                                         Sparseness / 0s no longer a
      orthonormal basis
                                           problem
ALS ALGORITHM 1
 Input: (user, item, strength)      1   4   3
  tuples
                                             3
     Anything you can quantify is
       input                             4       3   2
     Strength is positive           5       2       3
 Many tuples per user-item                      5
 R is sparse user-item              2   4               R
  interaction matrix
 rij = total strength of
  interaction between user i
  and item j
ALS ALGORITHM 2
 Follow “Collaborative                    1   1   1   0   0
  Filtering for Implicit
                                           0   0   1   0   0
  Feedback Datasets”
  www2.research.att.com/~yifanhu/PUB/cf.   0   1   0   1   1
  pdf
                                           1   0   1   0   1
 Construct “binary” matrix P
                                           0   0   0   1   0
    1 where R > 0
                                           1   1   0   0   0   P
    0 where R = 0

 Factor P, not R
    R returns in regularization

 Still sparse; implicit 0s fine
ALS ALGORITHM 3
 P is m x n
 Choose k << m, n
 Factor P as Q = X YT, Q ≈ P
    X is m x k ; YT is k x n               YT
 Find best approximation Q
    Minimize L2 norm of diff: || P-Q   X
      ||2
    Minimal squared error:
      “Least Squares”
 Recommendations are
  largest values in Q
ALS ALGORITHM 4
 Optimizing X, Y
  simultaneously is non-
  convex, hard
 If X or Y are fixed, system of
                                       YT
  linear equations:
  convex, easy
 Initialize Y with random         X
  values
 Solve for X
 Fix X, solve for Y
 Repeat (“Alternating”)
ALS ALGORITHM 5
 Define regularization weights cui = 1 + α rui
 Minimize:

  Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)

 Simple least-squares regression objective, plus
    Weighted least-squared error terms by strength,
      a penalty for not reconstructing 1 at “strong” association is higher
    Standard L2 regularization term
ALS ALGORITHM 6
 With fixed Y, compute optimal X
 Each row xu is independent
 Define Cu as diagonal matrix of cu (user strength weights)
 xu = (YTCuY + λI)-1 YTCupu
 Compare to simple least-squares regression solution (YTY)-1 YTpu
    Adds Tikhonov / ridge regression regularization term λI
    Attaches cu weights to YT

 See paper for how YTCuY is computed efficiently;
  skipping the engineering!
EXAMPLE FACTORIZATION
 k = 3, λ = 2, α = 40, 10 iterations

                                        0.96   0.99   0.99    0.38    0.93
         1    1   1   0   0
                                        0.44   0.39   0.98    -0.11   0.39
         0    0   1   0   0

                               ≈
                                        0.70   0.99   0.42    0.98    0.98
         0    1   0   1   1
         1    0   1   0   1             1.00   1.04   0.99    0.44    0.98   Q = X•YT
                                        0.11   0.51   -0.13   1.00    0.57
         0    0   0   1   0
                                        0.97   1.00   0.68    0.47    0.91
         1    1   0   0   0
FOLD-IN
 Need immediate, if               Note (YTY)(YTY)-1 = I
  approximate, updates for         Gives YT’s right inverse:
  new data                          YT (Y(YTY)-1) = I
 New user u needs new row         Xu = Qu Y(YTY)-1
  Qu = Xu YT
                                   Xu ≈ Pu Y(YTY)-1
 We have Pu ≈ Qu
                                   Recommend as usual:
 Compute Xu via right inverse:     Qu = XuYT
  X YT(YT)-1 = Q(YT)-1 so:
                                   For existing user, instead
  X = Q(YT)-1
                                    add to existing row Xu
 What is   (YT)-1?
THIS IS MYRRIX
 Soft-launched
 Serving Layer available
  as open source download
 Computation Layer available
  as beta
 Ready on Amazon EC2 / EMR
                                srowen@myrrix.com
 Full launch Q4 2012
 myrrix.com
APPENDIX
EXAMPLES

STACKOVERFLOW TAGS               WIKIPEDIA LINKS
 Recommend tags to               Recommend new linked
  questions                        articles from existing links
 Tag questions automatically,    Propose missing, related
  improve tag coverage             links
 3.5M questions x 30K tags       2.5M articles x 1.8M articles
 4.3 hours x 5 machines on       28 hours x 2 PCs on
  Amazon EMR                       Apache Hadoop 1.0.3
 $3.03 ≈ $0.08 per 100,000
  recs

More Related Content

What's hot

Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
Erik Bernhardsson
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
Oguz Semerci
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
Chris Johnson
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
Erik Bernhardsson
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
Mounia Lalmas-Roelleke
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
Ching-Wei Chen
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
Crossing Minds
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
Mounia Lalmas-Roelleke
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
Benjamin Le
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
Erik Bernhardsson
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
Crossing Minds
 
Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...
Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...
Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...
Massimo Quadrana
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
Vidhya Murali
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Personalization at Netflix - Making Stories Travel
Personalization at Netflix -  Making Stories Travel Personalization at Netflix -  Making Stories Travel
Personalization at Netflix - Making Stories Travel
Sudeep Das, Ph.D.
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
Faisal Siddiqi
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
Chris Johnson
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Alexandros Karatzoglou
 

What's hot (20)

Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
 
Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...
Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...
Personalizing Session-based Recommendations with Hierarchical Recurrent Neura...
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Personalization at Netflix - Making Stories Travel
Personalization at Netflix -  Making Stories Travel Personalization at Netflix -  Making Stories Travel
Personalization at Netflix - Making Stories Travel
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 

Similar to Big Practical Recommendations with Alternating Least Squares

Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
Sourya Dey
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
arogozhnikov
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
Dmitriy Selivanov
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
RADO7900
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
Shocky1
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Marketsguasoni
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
Sri Ambati
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouth
Qiqi Wang
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
anandsimple
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationIlya Loshchilov
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
Data Science London
 
opt_slides_ump.pdf
opt_slides_ump.pdfopt_slides_ump.pdf
opt_slides_ump.pdf
VikashKumarRoy3
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
Arithmer Inc.
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
Yogendra Singh
 
Partial Derivatives.pdf
Partial Derivatives.pdfPartial Derivatives.pdf
Partial Derivatives.pdf
HrushikeshDandu
 
Lec1 01
Lec1 01Lec1 01
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 

Similar to Big Practical Recommendations with Alternating Least Squares (20)

Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Markets
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouth
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
opt_slides_ump.pdf
opt_slides_ump.pdfopt_slides_ump.pdf
opt_slides_ump.pdf
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Partial Derivatives.pdf
Partial Derivatives.pdfPartial Derivatives.pdf
Partial Derivatives.pdf
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 

More from Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
Data Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
Data Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
Data Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
Data Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
Data Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
Data Science London
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
Data Science London
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
Data Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
Data Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
Data Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
Data Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
Data Science London
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
Data Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
Data Science London
 

More from Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Big Practical Recommendations with Alternating Least Squares

  • 1. Big, Practical Recommendations with Alternating Least Squares Sean Owen • Apache Mahout / Myrrix.com
  • 2. WHERE’S BIG LEARNING?  Next: Application Layer  Analytics  Machine Learning Applications  Like Apache Mahout  Common Big Data app today Processing  Clustering, recommenders, classifiers on Hadoop Database  Free, open source; not mature  Where’s commercialized Storage Big Learning?
  • 3. A RECOMMENDER SHOULD …  Answer in Real-time  Accept Diverse Input  Ingest new data, now  Not just people and products  Modify recommendations based  Not just explicit ratings on newest data  Clicks, views, buys  No “cold start” for new data  Side information  Scale Horizontally  Be “Pretty Accurate”  For queries per second  For size of data set
  • 4. NEED: 2-TIER ARCHITECTURE  Real-time Serving Layer  Quick results based on precomputed model  Incremental update  Partitionable for scale  Batch Computation Layer  Builds model  Scales out (on Hadoop?)  Asynchronous, occasional, long-lived runs
  • 5. A PRACTICAL ALGORITHM MATRIX FACTORIZATION BENEFITS  Factor user-item matrix to  Models intuition user-feature + feature-item  Factorization is batch matrix parallelizable  Well understood in ML, as:  Reconstruction (recs) in  Principal Component Analysis low-dimension is fast  Latent Semantic Indexing  Allows projection of new data  Several algorithms, like:  Cold start solution  Singular Value Decomposition  Approximate update solution  Alternating Least Squares
  • 6. A PRACTICAL IMPLEMENTATION ALTERNATING LEAST SQUARES BENEFITS  Simple factorization P ≈ X YT  Parallelizable by row --  Approximate: X, Y are very Hadoop-friendly “skinny” (low-rank)  Iterative: OK answer fast,  Faster than the SVD refine as long as desired  Trivially parallel, iterative  Yields to “binary” input model  Dumber than the SVD  Ratings as regularization instead  No singular values,  Sparseness / 0s no longer a orthonormal basis problem
  • 7. ALS ALGORITHM 1  Input: (user, item, strength) 1 4 3 tuples 3  Anything you can quantify is input 4 3 2  Strength is positive 5 2 3  Many tuples per user-item 5  R is sparse user-item 2 4 R interaction matrix  rij = total strength of interaction between user i and item j
  • 8. ALS ALGORITHM 2  Follow “Collaborative 1 1 1 0 0 Filtering for Implicit 0 0 1 0 0 Feedback Datasets” www2.research.att.com/~yifanhu/PUB/cf. 0 1 0 1 1 pdf 1 0 1 0 1  Construct “binary” matrix P 0 0 0 1 0  1 where R > 0 1 1 0 0 0 P  0 where R = 0  Factor P, not R  R returns in regularization  Still sparse; implicit 0s fine
  • 9. ALS ALGORITHM 3  P is m x n  Choose k << m, n  Factor P as Q = X YT, Q ≈ P  X is m x k ; YT is k x n YT  Find best approximation Q  Minimize L2 norm of diff: || P-Q X ||2  Minimal squared error: “Least Squares”  Recommendations are largest values in Q
  • 10. ALS ALGORITHM 4  Optimizing X, Y simultaneously is non- convex, hard  If X or Y are fixed, system of YT linear equations: convex, easy  Initialize Y with random X values  Solve for X  Fix X, solve for Y  Repeat (“Alternating”)
  • 11. ALS ALGORITHM 5  Define regularization weights cui = 1 + α rui  Minimize: Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)  Simple least-squares regression objective, plus  Weighted least-squared error terms by strength, a penalty for not reconstructing 1 at “strong” association is higher  Standard L2 regularization term
  • 12. ALS ALGORITHM 6  With fixed Y, compute optimal X  Each row xu is independent  Define Cu as diagonal matrix of cu (user strength weights)  xu = (YTCuY + λI)-1 YTCupu  Compare to simple least-squares regression solution (YTY)-1 YTpu  Adds Tikhonov / ridge regression regularization term λI  Attaches cu weights to YT  See paper for how YTCuY is computed efficiently; skipping the engineering!
  • 13. EXAMPLE FACTORIZATION  k = 3, λ = 2, α = 40, 10 iterations 0.96 0.99 0.99 0.38 0.93 1 1 1 0 0 0.44 0.39 0.98 -0.11 0.39 0 0 1 0 0 ≈ 0.70 0.99 0.42 0.98 0.98 0 1 0 1 1 1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 Q = X•YT 0.11 0.51 -0.13 1.00 0.57 0 0 0 1 0 0.97 1.00 0.68 0.47 0.91 1 1 0 0 0
  • 14. FOLD-IN  Need immediate, if  Note (YTY)(YTY)-1 = I approximate, updates for  Gives YT’s right inverse: new data YT (Y(YTY)-1) = I  New user u needs new row  Xu = Qu Y(YTY)-1 Qu = Xu YT  Xu ≈ Pu Y(YTY)-1  We have Pu ≈ Qu  Recommend as usual:  Compute Xu via right inverse: Qu = XuYT X YT(YT)-1 = Q(YT)-1 so:  For existing user, instead X = Q(YT)-1 add to existing row Xu  What is (YT)-1?
  • 15. THIS IS MYRRIX  Soft-launched  Serving Layer available as open source download  Computation Layer available as beta  Ready on Amazon EC2 / EMR srowen@myrrix.com  Full launch Q4 2012  myrrix.com
  • 17. EXAMPLES STACKOVERFLOW TAGS WIKIPEDIA LINKS  Recommend tags to  Recommend new linked questions articles from existing links  Tag questions automatically,  Propose missing, related improve tag coverage links  3.5M questions x 30K tags  2.5M articles x 1.8M articles  4.3 hours x 5 machines on  28 hours x 2 PCs on Amazon EMR Apache Hadoop 1.0.3  $3.03 ≈ $0.08 per 100,000 recs