SlideShare a Scribd company logo

Alex Lin
Senior Architect
Intelligent Mining
  Predictivemodeling methodology
  k-Nearest Neighbor (kNN) algorithm
  Singular value decomposition (SVD)
   method for dimensionality reduction
  Using a synthetic data set to test and
   improve your model
  Experiment and results

The Business Problem
  Design
        product recommender solution that will
 increase revenue.

How Do We Increase Revenue?

  Revenue                      Increase
                              Unit Price
             Increase Avg.
              Order Value
                             Units / Order

  Is   this recommendation effective?

                       Unit Price

                      Units / Order

What am I
going to do?

Predictive Model
   Framework

                                    ML                  Prediction
     Data          Features
                                 Algorithm               Output

What data?      What feature?   Which Algorithm ?   Cross-sell & Up-sell

What Data to Use?
  Explicit   data
       Ratings
       Comments
  Implicit   data
       Order history / Return history
       Cart events
       Page views
       Click-thru
       Search log
  In
    today’s talk we only use Order history and Cart
Predictive Model

                                    ML                  Prediction
     Data          Features
                                 Algorithm               Output

Order History   What feature?   Which Algorithm ?   Cross-sell & Up-sell
Cart Events                                         Recommendation

What Features to Use?
  We know that a given product tends to get
   purchased by customers with similar tastes or
  Use user engagement data to describe a product.

                1   2   3     4   5   6     7   8   9   10    …   n

           17   1       .25           .25       1       .25

                         user engagement vector

Data Representation / Features
  When we merge every item’s user engagement
 vector, we got a m x n item-user matrix
              1   2     3     4     5   6   7     8     9   10    …   n

          1   1         .25             1                   .25

          2                                 .25

          3   1               .25                 1
          4       .25               1             .25   1

                        1                   1


Data Normalization
  Ensurethe magnitudes of the entries in the
 dataset matrix are appropriate
               1     2     3     4     5     6     7     8     9     10    …   n

           1   1
               .5          1
                           .9                 1
                                             .92                      1

           2                                        1

           3    1
               .67                1
                                 .46                      1

           4          1
                     .39                1
                                       .82                1
                                                         .76    1

                            1                      1

                           .52                     .8


  Removecolumn average – so frequent buyers
 don’t dominate the model
Data Normalization
  Differentengagement data points (Order / Cart /
   Page View) should have different weights
  Common normalization strategies:
      Remove column average
      Remove row average
      Remove global mean
      Z-score
      Fill-in the null values

Predictive Model

                                         ML                  Prediction
     Data          Features
                                      Algorithm               Output

Order History   User engagement      Which Algorithm ?   Cross-sell & Up-sell
Cart Events     vector                                   Recommendation

                Data Normalization

Which Algorithm?
  How
     do we find the items that have similar user
 engagement data?
               1   2   3     4   5      6   7   8     9   10    …   n

          1    1       .25              1                 1
          2                                 1

          17   1             1          1       .25       .25

          18       1             .25    1       1     1




  We
    can find the items that have similar user
 engagement vectors with kNN algorithm
k-Nearest Neighbor (kNN)
  Find
      the k items that have the most similar user
  engagement vectors
              1     2   3    4   5     6   7    8   9    10   …   n

          1   .5        1              1                 1

          2         1                      .5            1

          3   1              1                  1   1

          4         1            .5        1        1

                        .5                 1

          m                  1                      .5

  Nearest         Neighbors of Item 4 = [2,3,1]                      16
Similarity Measure for kNN
                   1    2     3        4        5          6            7         8       9   10   …    n

               2        1                                           .5                        1

               4        1                       .5                      1                 1

    Jaccard coefficient:
                                                                   (1+ 1)
                             sim(a,b) =
                                                     (1+ 1+ 1) + (1+ 1+ 1+ 1) − (1+ 1)
    Cosine similarity:
                                  a•b                                                         (1*1+ 0.5 *1)
           sim(a,b) = cos(a,b) =                                    =
               €                 a ∗ b                                          (12 + 0.5 2 + 12 ) * (12 + 0.5 2 + 12 + 12 )
                                  2                            2

    Pearson Correlation:

€         corr(a,b) =
                             ∑ (r − r )(r − r )
                                  i    ai       a     bi            b
                                                                                                   m∑ aibi − ∑ ai ∑ bi

                            ∑ (r − r ) ∑ (r − r )
                              i   ai        a
                                                       i       bi           b
                                                                                          m∑ ai2 − (∑ ai ) 2 m∑ bi2 − (∑ bi ) 2
                             match _ cols * Dotprod(a,b) − sum(a) * sum(b)
               match _ cols * sum(a 2 ) − (sum(a)) 2 match _ cols * sum(b 2 ) − (sum(b)) 2
k-Nearest Neighbor (kNN)
  space                                                Similarity Measure
                                                       (cosine similarity)

              8                   2



                              kNN k=5
                              Nearest Neighbors(8) = [9,6,3,1,2]       18
Predictive Model
   Ver.    1: kNN

                                              ML                   Prediction
     Data               Features
                                           Algorithm                Output

Order History        User engagement      k-Nearest Neighbor   Cross-sell & Up-sell
Cart Events          vector               (kNN)                Recommendation

                     Data Normalization

Cosine Similarity – Code fragment
 long i_cnt = 100000; // number of items 100K
 long u_cnt = 2000000; // number of users 2M
 double data[i_cnt][u_cnt]; // 100K by 2M dataset matrix (in reality, it needs to be malloc allocation)
 double norm[i_cnt];

 // assume data matrix is loaded
 // calculate vector norm for each user engagement vector
 for (i=0; i<i_cnt; i++) {
     norm[i] = 0;
     for (f=0; f<u_cnt; f++) {
        norm[i] += data[i][f] * data [i][f];
     }                               1. 100K rows x 100K rows x 2M features --> scalability problem
     norm[i] = sqrt(norm[i]);              kd-tree, Locality sensitive hashing,
                                        MapReduce/Hadoop, Multicore/Threading, Stream Processors
 // cosine similarity calculation 2. data[i] is high-dimensional and sparse, similarity measures
 for (i=0; i<i_cnt; i++) { // loop thru 100Knot reliable --> accuracy problem
  for (j=0; j<i_cnt; j++) { // loop thru 100K
     dot_product = 0;
                                         This leads us to The SVD dimensionality reduction !
     for (f=0; f<u_cnt; f++) { // loop thru entire user space 2M
         dot_product += data[i][f] * data[j][f];
     printf(“%d %d %lfn”, i, j, dot_product/(norm[i] * norm[j]));
   }                                                                                                      20
 // find the Top K nearest neighbors here
Singular Value Decomposition
            A = U × S ×VT
                  A                             U                     S                     VT
             m x n matrix                   m x r matrix         r x r matrix           r x n matrix



                                                                rank = k

               users                                                                      users
             Ak = U k × Sk × VkT
           Low rank approx. Item profile is        U k * Sk

           Low rank approx. User profile is         S k *VkT                                          21

€          Low rank approx. Item-User matrix is
                                 €                         U k * Sk * Sk *VkT

Reduced SVD
        Ak = U k × Sk × VkT
               Ak                          Uk                Sk                VkT
         100K x 2M matrix         100K x 3 matrix      3 x 3 matrix        3 x 2M matrix

                                                         7   0    0

                                                         0   3    0

                                                         0   0    1              users
                                                         rank = 3
                                                                      Singular Values


       Low rank approx. Item profile is    U k * Sk


SVD Factor Interpretation                                  S
                                                     3 x 3 matrix

  Singular        values plot (rank=512)              7   0   0

                                                       0   3   0

                                                       0   0   1

                                                                    Singular Values

More Significant        Latent Factors   Noises + Others             Less Significant
SVD Dimensionality Reduction

                                  U k * Sk
        <----- latent factors ----->                     # of users


                            Need to find the most optimal low rank !!

Missing values

    Difference between “0” and “unknown”
    Missing values do NOT appear randomly.
    Value = (Preference Factors) + (Availability) – (Purchased
     elsewhere) – (Navigation inefficiency) – etc.
    Approx. Value = (Preference Factors) +/- (Noise)
    Modeling missing values correctly will help us make good
     recommendations, especially when working with an extremely
     sparse data set

Singular Value Decomposition
  Use SVD to reduce dimensionality, so neighborhood
   formation happens in reduced user space
  SVD helps model to find the low rank approx. dataset
   matrix, while retaining the critical latent factors and
   ignoring noise.
  Optimal low rank needs to be tuned
  SVD is computationally expensive

    SVD Libraries:
         Matlab [U, S, V] = svds(A,256);
         SVDPACKC
         SVDLIBC
         GHAPACK   26
Predictive Model
   Ver.    2: SVD+kNN

                                           ML                   Prediction
     Data           Features
                                        Algorithm                Output

Order History     User engagement     k-Nearest Neighbors   Cross-sell & Up-sell
Cart Events            vector         (kNN) in reduced      Recommendation

                 Data Normalization


Synthetic Data Set
  Why   do we use synthetic data set?

  Sowe can test our new model in a controlled
Synthetic Data Set
  16latent factors synthetic e-commerce
  data set
      Dimension: 1,000 (items) by 20,000 (users)
      16 user preference factors
      16 item property factors (non-negative)
      Txn Set: n = 55,360 sparsity = 99.72 %
      Txn+Cart Set: n = 192,985 sparsity = 99.03%
      Download:

       user_id   item_id   type
       10        42        0.25
       10        997       0.25
       10        950       0.25                              29
       11        836       0.25
       11        225       1
Synthetic Data Set
Item property      User preference             Purchase Likelihood score
                                                           1K x 20K matrix
   factors             factors
 1K x 16 matrix      16 x 20K matrix
                                                    X11   X12   X13   X14   X15   X16
                                                    X21   X22   X12   X24   X25   X26

                                                    X31   X32   X33   X34   X35   X36
 a    b     c        z
                                                    X41   X42   X43   X44   X45   X46

                                                    X51   X52   X53   X54   X55   X56


X32 = (a, b, c) . (x, y, z) = a * x + b * y + c * z

X32 = Likelihood of Item 3 being purchased by User 2
Synthetic Data Set
X11                        X31                                   X51
                                 Based on the distribution,
                                 pre-determine # of items
X21                        X41   purchased by an user            X41
                                 (# of item=2)
X31     Sort by Purchase   X21   From the top, select and skip
        likelihood Score         certain items to create data
X41                        X51   sparsity.                       X21

X51                        X11                                   X11

      User 1 purchased Item 4 and Item 1

Experiment Setup
  Each model (Random / kNN / SVD+kNN) will
   generate top 20 recommendations for each item.
  Compare model output to the actual top 20
   provided by synthetic data set
  Evaluation Metrics :
      Precision %: Overlapping of the top 20 between model
       output and actual (higher the better)
                     {Found _ Top20 _ items} ∩ {Actual _ Top20 _ items}
       Precision =
                                 {Found _ Top20 _ items}

      Quality metric: Average of the actual ranking in the
       model output (lower the better)
       €                                                                              32

         1   2   30      47   50   21              1     2   368   62     900   510
Experimental Result
     kNN            vs. Random (Control)

Precision %                           Quality
(higher is better)                    (Lower is better)

Experimental Result
    Precision       % of SVD+kNN
Recall %
(higher is better)


                                          SVD Rank
Experimental Result
      Quality      of SVD+kNN
(Lower is better)


                                         SVD Rank
Experimental Result
    The       effect of using Cart data
Precision %
(higher is better)

                                           SVD Rank
Experimental Result
  The       effect of using Cart data
(Lower is better)

                                         SVD Rank
  Predictivemodeling methodology
  k-Nearest Neighbor (kNN) algorithm
  Singular value decomposition (SVD)
   method for dimensionality reduction
  Using a synthetic data set to test and
   improve your model
  Experiment and results

    J.S. Breese, D. Heckerman and C. Kadie, "Empirical Analysis of
     Predictive Algorithms for Collaborative Filtering," in Proceedings of the
     Fourteenth Conference on Uncertainity in Artificial Intelligence (UAI
     1998), 1998.
    B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "Item-based collaborative
     filtering recommendation algorithms," in Proceedings of the Tenth
     International Conference on the World Wide Web (WWW 10), pp. 285-295,
    B. Sarwar, G. Karypis, J. Konstan, and J. Riedl "Application of
     Dimensionality Reduction in Recommender System A Case Study" In
     ACM WebKDD 2000 Web Mining for E-Commerce Workshop
    Apache Lucene Mahout
    Cofi: A Java-Based Collaborative Filtering Library

Thank you
  Any   question or comment?


More Related Content

What's hot

Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Tamir Taha
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
Nicolas Sarramagna
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
DataminingTools Inc
01 deloitte predictive analytics analytics summit-09-30-14_092514
01   deloitte predictive analytics analytics summit-09-30-14_09251401   deloitte predictive analytics analytics summit-09-30-14_092514
01 deloitte predictive analytics analytics summit-09-30-14_092514
Analytics for offline retail: Using Advanced Machine Learning
Analytics for offline retail: Using Advanced Machine LearningAnalytics for offline retail: Using Advanced Machine Learning
Analytics for offline retail: Using Advanced Machine Learning
AlgoAnalytics Financial Consultancy Pvt. Ltd.
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
Roger Barga
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
IRJET- Stock Market Prediction using Machine Learning Techniques
IRJET- Stock Market Prediction using Machine Learning TechniquesIRJET- Stock Market Prediction using Machine Learning Techniques
IRJET- Stock Market Prediction using Machine Learning Techniques
IRJET Journal
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
Benjamin Taylor
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
inovex GmbH
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
HJ van Veen
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
Roger Barga
“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification
Galit Shmueli
modeling and predicting cyber hacking breaches
modeling and predicting cyber hacking breaches modeling and predicting cyber hacking breaches
modeling and predicting cyber hacking breaches
Venkat Projects
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
Lalit Jain

What's hot (20)

Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
01 deloitte predictive analytics analytics summit-09-30-14_092514
01   deloitte predictive analytics analytics summit-09-30-14_09251401   deloitte predictive analytics analytics summit-09-30-14_092514
01 deloitte predictive analytics analytics summit-09-30-14_092514
Analytics for offline retail: Using Advanced Machine Learning
Analytics for offline retail: Using Advanced Machine LearningAnalytics for offline retail: Using Advanced Machine Learning
Analytics for offline retail: Using Advanced Machine Learning
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
IRJET- Stock Market Prediction using Machine Learning Techniques
IRJET- Stock Market Prediction using Machine Learning TechniquesIRJET- Stock Market Prediction using Machine Learning Techniques
IRJET- Stock Market Prediction using Machine Learning Techniques
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification
modeling and predicting cyber hacking breaches
modeling and predicting cyber hacking breaches modeling and predicting cyber hacking breaches
modeling and predicting cyber hacking breaches
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection

Viewers also liked

Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and adviceThe Marketing Distillery
Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
Jeffrey Strickland, Ph.D., CMSP
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
Kimberley Mitchell
The Future of Personalized Health Care: Predictive Analytics by @Rock_Health
The Future of Personalized Health Care: Predictive Analytics by @Rock_HealthThe Future of Personalized Health Care: Predictive Analytics by @Rock_Health
The Future of Personalized Health Care: Predictive Analytics by @Rock_Health
Rock Health
Three Approaches to Predictive Analytics in Healthcare
Three Approaches to Predictive Analytics in HealthcareThree Approaches to Predictive Analytics in Healthcare
Three Approaches to Predictive Analytics in Healthcare
Health Catalyst
From BI to Predictive Analytics
From BI to Predictive AnalyticsFrom BI to Predictive Analytics
From BI to Predictive Analytics
Decision Management Solutions
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Intuit Inc.
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
Beam search
Beam searchBeam search
Bhavini Shah
LieDM asociaicjos parama institucijoms
LieDM asociaicjos parama institucijomsLieDM asociaicjos parama institucijoms
LieDM asociaicjos parama institucijoms
Estela Dauksiene
Loss Reserving
Loss ReservingLoss Reserving
Loss Reserving
Nabil Janmohamed
Predictive Analytics -Workshop
Predictive Analytics -WorkshopPredictive Analytics -Workshop
Predictive Analytics -Workshop
Predictive Marketing using Google Analytics
Predictive Marketing using Google AnalyticsPredictive Marketing using Google Analytics
Predictive Marketing using Google Analytics
Barry Hand
Stochastic Loss Reserving-General Insurance
Stochastic Loss Reserving-General InsuranceStochastic Loss Reserving-General Insurance
Stochastic Loss Reserving-General Insurance
Syed Danish Ali

Viewers also liked (20)

Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advice
Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
From Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive AnalyticsFrom Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive Analytics
The Future of Personalized Health Care: Predictive Analytics by @Rock_Health
The Future of Personalized Health Care: Predictive Analytics by @Rock_HealthThe Future of Personalized Health Care: Predictive Analytics by @Rock_Health
The Future of Personalized Health Care: Predictive Analytics by @Rock_Health
Three Approaches to Predictive Analytics in Healthcare
Three Approaches to Predictive Analytics in HealthcareThree Approaches to Predictive Analytics in Healthcare
Three Approaches to Predictive Analytics in Healthcare
From BI to Predictive Analytics
From BI to Predictive AnalyticsFrom BI to Predictive Analytics
From BI to Predictive Analytics
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Beam search
Beam searchBeam search
Beam search
LieDM asociaicjos parama institucijoms
LieDM asociaicjos parama institucijomsLieDM asociaicjos parama institucijoms
LieDM asociaicjos parama institucijoms
Loss Reserving
Loss ReservingLoss Reserving
Loss Reserving
Predictive Analytics -Workshop
Predictive Analytics -WorkshopPredictive Analytics -Workshop
Predictive Analytics -Workshop
Predictive Marketing using Google Analytics
Predictive Marketing using Google AnalyticsPredictive Marketing using Google Analytics
Predictive Marketing using Google Analytics
Stochastic Loss Reserving-General Insurance
Stochastic Loss Reserving-General InsuranceStochastic Loss Reserving-General Insurance
Stochastic Loss Reserving-General Insurance

Similar to Building a Predictive Model

2013 01-23 when analytics projects go wrong
2013 01-23 when analytics projects go wrong2013 01-23 when analytics projects go wrong
2013 01-23 when analytics projects go wrong
Julien Coquet
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
Trushita Redij
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
Agile Workshop: Agile Metrics
Agile Workshop: Agile MetricsAgile Workshop: Agile Metrics
Agile Workshop: Agile MetricsSiddhi
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That WorkScikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That Work
Damian R. Mingle, MBA
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdf
Data Science Council of America
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIME
Ali Raza Anjum
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Benjamin Bengfort
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
IBM Cognos 10 Framework Manager Metadata Modeling: Tips and Tricks
IBM Cognos 10 Framework Manager Metadata Modeling: Tips and TricksIBM Cognos 10 Framework Manager Metadata Modeling: Tips and Tricks
IBM Cognos 10 Framework Manager Metadata Modeling: Tips and Tricks
Z suzanne van_den_bosch
Z suzanne van_den_boschZ suzanne van_den_bosch
Z suzanne van_den_bosch
Hoopeer Hoopeer
Deep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee InsightsDeep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee Insights
Apigee | Google Cloud
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon Web Services
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Turi, Inc.
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association

Similar to Building a Predictive Model (20)

2013 01-23 when analytics projects go wrong
2013 01-23 when analytics projects go wrong2013 01-23 when analytics projects go wrong
2013 01-23 when analytics projects go wrong
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Agile Workshop: Agile Metrics
Agile Workshop: Agile MetricsAgile Workshop: Agile Metrics
Agile Workshop: Agile Metrics
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That WorkScikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That Work
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdf
E miner
E minerE miner
E miner
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIME
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
IBM Cognos 10 Framework Manager Metadata Modeling: Tips and Tricks
IBM Cognos 10 Framework Manager Metadata Modeling: Tips and TricksIBM Cognos 10 Framework Manager Metadata Modeling: Tips and Tricks
IBM Cognos 10 Framework Manager Metadata Modeling: Tips and Tricks
Z suzanne van_den_bosch
Z suzanne van_den_boschZ suzanne van_den_bosch
Z suzanne van_den_bosch
Deep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee InsightsDeep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee Insights
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Lean startup
Lean startupLean startup
Lean startup
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
Abida Shariff
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Building a Predictive Model

  • 2. Outline   Predictivemodeling methodology   k-Nearest Neighbor (kNN) algorithm   Singular value decomposition (SVD) method for dimensionality reduction   Using a synthetic data set to test and improve your model   Experiment and results 2
  • 3. The Business Problem   Design product recommender solution that will increase revenue. $$ 3
  • 4. How Do We Increase Revenue? Increase Conversion Increase Revenue Increase Unit Price Increase Avg. Order Value Increase Units / Order 4
  • 5. Example   Is this recommendation effective? Increase Unit Price Increase Units / Order 5
  • 6. What am I going to do? 6
  • 7. Predictive Model   Framework ML Prediction Data Features Algorithm Output What data? What feature? Which Algorithm ? Cross-sell & Up-sell Recommendation 7
  • 8. What Data to Use?   Explicit data   Ratings   Comments   Implicit data   Order history / Return history   Cart events   Page views   Click-thru   Search log   In today’s talk we only use Order history and Cart events 8
  • 9. Predictive Model ML Prediction Data Features Algorithm Output Order History What feature? Which Algorithm ? Cross-sell & Up-sell Cart Events Recommendation 9
  • 10. What Features to Use?   We know that a given product tends to get purchased by customers with similar tastes or needs.   Use user engagement data to describe a product. users 1 2 3 4 5 6 7 8 9 10 … n item 17 1 .25 .25 1 .25 user engagement vector 10
  • 11. Data Representation / Features   When we merge every item’s user engagement vector, we got a m x n item-user matrix users 1 2 3 4 5 6 7 8 9 10 … n 1 1 .25 1 .25 2 .25 items 3 1 .25 1 4 .25 1 .25 1 1 1 … m 11
  • 12. Data Normalization   Ensurethe magnitudes of the entries in the dataset matrix are appropriate users 1 2 3 4 5 6 7 8 9 10 … n 1 1 .5 1 .9 1 .92 1 .49 2 1 .79 items 3 1 .67 1 .46 1 .73 4 1 .39 1 .82 1 .76 1 .69 1 1 … … .52 .8 m   Removecolumn average – so frequent buyers 12 don’t dominate the model
  • 13. Data Normalization   Differentengagement data points (Order / Cart / Page View) should have different weights   Common normalization strategies:   Remove column average   Remove row average   Remove global mean   Z-score   Fill-in the null values 13
  • 14. Predictive Model ML Prediction Data Features Algorithm Output Order History User engagement Which Algorithm ? Cross-sell & Up-sell Cart Events vector Recommendation Data Normalization 14
  • 15. Which Algorithm?   How do we find the items that have similar user engagement data? users 1 2 3 4 5 6 7 8 9 10 … n 1 1 .25 1 1 2 1 items 17 1 1 1 .25 .25 18 1 .25 1 1 1 1 … .25 m   We can find the items that have similar user 15 engagement vectors with kNN algorithm
  • 16. k-Nearest Neighbor (kNN)   Find the k items that have the most similar user engagement vectors users 1 2 3 4 5 6 7 8 9 10 … n 1 .5 1 1 1 2 1 .5 1 items 3 1 1 1 1 4 1 .5 1 1 .5 1 … m 1 .5   Nearest Neighbors of Item 4 = [2,3,1] 16
  • 17. Similarity Measure for kNN users 1 2 3 4 5 6 7 8 9 10 … n items 2 1 .5 1 4 1 .5 1 1   Jaccard coefficient: (1+ 1) sim(a,b) = (1+ 1+ 1) + (1+ 1+ 1+ 1) − (1+ 1)   Cosine similarity: a•b (1*1+ 0.5 *1) sim(a,b) = cos(a,b) = = € a ∗ b (12 + 0.5 2 + 12 ) * (12 + 0.5 2 + 12 + 12 ) 2 2   Pearson Correlation: € corr(a,b) = ∑ (r − r )(r − r ) i ai a bi b = m∑ aibi − ∑ ai ∑ bi ∑ (r − r ) ∑ (r − r ) i ai a 2 i bi b 2 m∑ ai2 − (∑ ai ) 2 m∑ bi2 − (∑ bi ) 2 17 match _ cols * Dotprod(a,b) − sum(a) * sum(b) = match _ cols * sum(a 2 ) − (sum(a)) 2 match _ cols * sum(b 2 ) − (sum(b)) 2
  • 18. k-Nearest Neighbor (kNN) Item feature space Similarity Measure (cosine similarity) 7 9 8 2 1 4 6 5 3 kNN k=5 Nearest Neighbors(8) = [9,6,3,1,2] 18
  • 19. Predictive Model   Ver. 1: kNN ML Prediction Data Features Algorithm Output Order History User engagement k-Nearest Neighbor Cross-sell & Up-sell Cart Events vector (kNN) Recommendation Data Normalization 19
  • 20. Cosine Similarity – Code fragment long i_cnt = 100000; // number of items 100K long u_cnt = 2000000; // number of users 2M double data[i_cnt][u_cnt]; // 100K by 2M dataset matrix (in reality, it needs to be malloc allocation) double norm[i_cnt]; // assume data matrix is loaded …… // calculate vector norm for each user engagement vector for (i=0; i<i_cnt; i++) { norm[i] = 0; for (f=0; f<u_cnt; f++) { norm[i] += data[i][f] * data [i][f]; } 1. 100K rows x 100K rows x 2M features --> scalability problem norm[i] = sqrt(norm[i]); kd-tree, Locality sensitive hashing, } MapReduce/Hadoop, Multicore/Threading, Stream Processors // cosine similarity calculation 2. data[i] is high-dimensional and sparse, similarity measures for (i=0; i<i_cnt; i++) { // loop thru 100Knot reliable --> accuracy problem are for (j=0; j<i_cnt; j++) { // loop thru 100K dot_product = 0; This leads us to The SVD dimensionality reduction ! for (f=0; f<u_cnt; f++) { // loop thru entire user space 2M dot_product += data[i][f] * data[j][f]; } printf(“%d %d %lfn”, i, j, dot_product/(norm[i] * norm[j])); } 20 // find the Top K nearest neighbors here …….
  • 21. Singular Value Decomposition (SVD) A = U × S ×VT A U S VT m x n matrix m x r matrix r x r matrix r x n matrix € items items rank = k k<r users users users Ak = U k × Sk × VkT   Low rank approx. Item profile is U k * Sk items   Low rank approx. User profile is S k *VkT 21 €   Low rank approx. Item-User matrix is € U k * Sk * Sk *VkT €
  • 22. Reduced SVD Ak = U k × Sk × VkT Ak Uk Sk VkT 100K x 2M matrix 100K x 3 matrix 3 x 3 matrix 3 x 2M matrix 7 0 0 0 3 0 items items 0 0 1 users rank = 3 Descending Singular Values users   Low rank approx. Item profile is U k * Sk 22 €
  • 23. SVD Factor Interpretation S 3 x 3 matrix   Singular values plot (rank=512) 7 0 0 0 3 0 0 0 1 Descending Singular Values 23 More Significant Latent Factors Noises + Others Less Significant
  • 24. SVD Dimensionality Reduction U k * Sk <----- latent factors -----> # of users € items 3 rank Need to find the most optimal low rank !! 10 24
  • 25. Missing values   Difference between “0” and “unknown”   Missing values do NOT appear randomly.   Value = (Preference Factors) + (Availability) – (Purchased elsewhere) – (Navigation inefficiency) – etc.   Approx. Value = (Preference Factors) +/- (Noise)   Modeling missing values correctly will help us make good recommendations, especially when working with an extremely sparse data set 25
  • 26. Singular Value Decomposition (SVD)   Use SVD to reduce dimensionality, so neighborhood formation happens in reduced user space   SVD helps model to find the low rank approx. dataset matrix, while retaining the critical latent factors and ignoring noise.   Optimal low rank needs to be tuned   SVD is computationally expensive   SVD Libraries:   Matlab [U, S, V] = svds(A,256);   SVDPACKC   SVDLIBC   GHAPACK 26
  • 27. Predictive Model   Ver. 2: SVD+kNN ML Prediction Data Features Algorithm Output Order History User engagement k-Nearest Neighbors Cross-sell & Up-sell Cart Events vector (kNN) in reduced Recommendation space Data Normalization SVD 27
  • 28. Synthetic Data Set   Why do we use synthetic data set?   Sowe can test our new model in a controlled environment 28
  • 29. Synthetic Data Set   16latent factors synthetic e-commerce data set   Dimension: 1,000 (items) by 20,000 (users)   16 user preference factors   16 item property factors (non-negative)   Txn Set: n = 55,360 sparsity = 99.72 %   Txn+Cart Set: n = 192,985 sparsity = 99.03%   Download: user_id item_id type 10 42 0.25 10 997 0.25 10 950 0.25 29 11 836 0.25 11 225 1
  • 30. Synthetic Data Set Item property User preference Purchase Likelihood score 1K x 20K matrix factors factors 1K x 16 matrix 16 x 20K matrix X11 X12 X13 X14 X15 X16 x X21 X22 X12 X24 X25 X26 y items X31 X32 X33 X34 X35 X36 a b c z X41 X42 X43 X44 X45 X46 X51 X52 X53 X54 X55 X56 users X32 = (a, b, c) . (x, y, z) = a * x + b * y + c * z X32 = Likelihood of Item 3 being purchased by User 2 30
  • 31. Synthetic Data Set X11 X31 X51 Based on the distribution, pre-determine # of items X21 X41 purchased by an user X41 (# of item=2) X31 Sort by Purchase X21 From the top, select and skip X31 likelihood Score certain items to create data X41 X51 sparsity. X21 X51 X11 X11   User 1 purchased Item 4 and Item 1 31
  • 32. Experiment Setup   Each model (Random / kNN / SVD+kNN) will generate top 20 recommendations for each item.   Compare model output to the actual top 20 provided by synthetic data set   Evaluation Metrics :   Precision %: Overlapping of the top 20 between model output and actual (higher the better) {Found _ Top20 _ items} ∩ {Actual _ Top20 _ items} Precision = {Found _ Top20 _ items}   Quality metric: Average of the actual ranking in the model output (lower the better) € 32 1 2 30 47 50 21 1 2 368 62 900 510
  • 33. Experimental Result   kNN vs. Random (Control) Precision % Quality (higher is better) (Lower is better) 33
  • 34. Experimental Result   Precision % of SVD+kNN Recall % (higher is better) Improvement 34 SVD Rank
  • 35. Experimental Result   Quality of SVD+kNN Quality (Lower is better) Improvement 35 SVD Rank
  • 36. Experimental Result   The effect of using Cart data Precision % (higher is better) 36 SVD Rank
  • 37. Experimental Result   The effect of using Cart data Quality (Lower is better) 37 SVD Rank
  • 38. Outline   Predictivemodeling methodology   k-Nearest Neighbor (kNN) algorithm   Singular value decomposition (SVD) method for dimensionality reduction   Using a synthetic data set to test and improve your model   Experiment and results 38
  • 39. References   J.S. Breese, D. Heckerman and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering," in Proceedings of the Fourteenth Conference on Uncertainity in Artificial Intelligence (UAI 1998), 1998.   B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "Item-based collaborative filtering recommendation algorithms," in Proceedings of the Tenth International Conference on the World Wide Web (WWW 10), pp. 285-295, 2001.   B. Sarwar, G. Karypis, J. Konstan, and J. Riedl "Application of Dimensionality Reduction in Recommender System A Case Study" In ACM WebKDD 2000 Web Mining for E-Commerce Workshop   Apache Lucene Mahout   Cofi: A Java-Based Collaborative Filtering Library 39
  • 40. Thank you   Any question or comment? 40