RECOMMENDATION
ENGINE
DEMYSTIFIED
NEIGHBORHOOD METHODS
COLLABORATIVE FILTERING
Alex Lin
Senior Architect
Intelligent Mining
Outline
 Introduction

 User-oriented  Collaborative Filtering
 Item-oriented Collaborative Filtering

 Challenges

 Best Practices
Recommendation Engine
What is a Recommendation Engine (RE)?
  RE takes “observation” data and uses machine
   learning / statistical algorithms to predict
   outcomes or levels of interest.
  “Recommender systems form a specific type of
   information filtering (IF) technique that attempts
   to present information items (movies, music,
   books, news, images, web pages, etc.) that are
   likely of interest to the user.” – Wikipedia
Recommendation Engine
                                            Demographic
                                               based
                                                                         Clustering
       Social                   Knowledge                                  based
       Graph                      based
       based

                                                          Content
                Collaborative                              based
                  Filtering                                         Category
                                               Mixture
                                               Models                based




  This
      presentation will focus on
  Neighborhood-based Collaborative Filtering
      User-oriented method
      Item-oriented method
Neighborhood-based
Collaborative Filtering
                                   Data Normalization

     Data
                   Input Data
 (Observations)   Representation
                                      User-user similarity
                                              vs.
                                      Item-item similarity
                   Neighborhood
                    Formation

                                           Make
                                      Recommendation
                  Recommendation
                    Generation
                                      Web Page, Email,
                                         Direct Mail,
                                     Marketing Campaign,
                    Delivering               etc.
                    Mechanism
Outline
 Introduction

 User-oriented  Collaborative Filtering
 Item-oriented Collaborative Filtering

 Challenges

 Best Practices
User-oriented CF
  Input       Data Representation: Users / Items matrix.
                                    users
                1   2   3   4   5    6   7   8   9   10   …   n

           1    1       1            1               1

           2                             1   1   1
   items




           3    1   1       1                1   1

           4        1           1            1   1

                        1                1
           …




           m



   Cell       value “1” means user purchased the item.
   Data       Normalization is not shown on this slide.
User-oriented CF
  Neighborhood             Formation:
         Find the k most like-minded users in the system.
                              users
                1   2   3   4   5   6   7   8   9   10   …   n

            1   1   1   1           1               1

            2                           1   1   1
  items




            3   1   1       1               1   1            1

            4       1           1           1   1

                        1               1
           …




            m               1                   1
User-oriented CF
  Neighborhood             Formation:
         Find the k most like-minded users in the system.
                              users
                1   2   3   4   5   6   7   8   9   10   …   n

            1   1   1   1           1               1

            2                           1   1   1
  items




            3   1   1       1               1   1            1

            4       1           1           1   1

                        1               1
           …




            m               1                   1


   Identify        U9 and U2 are similar to U8
User-oriented CF
  Recommendation            Generation:

                                 users
             1   2   3   4   5    6   7   8   9   10   …   n

         1   1   1   1            1               1

         2                            1   1   1
 items




         3   1   1       1                1   1            1

         4       1           1            1   1

                     1                1
         …




         m               1                    1


  Identify I1       and I9 are not yet purchased by U8
User-oriented CF
  Recommendation             Generation:

                                  users
             1    2   3   4   5    6   7   8     9   10   …   n

         1   1    1   1            1       0.7       1

         2                             1   1     1
 items




         3   1    1       1                1     1            1

         4        1           1            1     1

                      1                1
         …




         m                1                0.9   1


  Predict       by taking weighted sum
User-oriented CF
Practical Implementation
    Compute and store all user-user similarities.

                                                                   u•v
         Cosine similarity:   sim(u,v) = cos(u,v) =
                                                                 u ∗ v
                                                                    2       2

    Find N items that will be most likely purchased by user u.
         Find k most similar users to u, save to Usim
         Get all items purchased by Usim, save to Icandidate
                   €
         Remove unavailable items in Icandidate
         Get all items purchased by u, save to Ipurchased
         Take Icandidate – Ipurchased = Irecmd
         Re-order items in Irecmd based on sum of user-user similarity

                               ∑ v ∈k−similarUser(u)
                                                       userSim(u,v) * rvi
                   pred(u,i) =
                                ∑   v ∈k−similarUser( u)
                                                           userSim(u,v)



          €
Outline
 Introduction

 User-oriented  Collaborative Filtering
 Item-oriented Collaborative Filtering

 Challenges

 Best Practices
Item-oriented CF
  Input       Data Representation: Users / Items matrix.
                                    users
                1   2   3   4   5    6   7   8   9   10   …   n

           1    1       1            1               1

           2                             1   1   1
   items




           3    1   1       1                1   1

           4        1           1            1   1

                        1                1
           …




           m



   Cell       value “1” means user purchased the item.
   Data       Normalization is not shown on this slide.
Item-oriented CF
  Neighborhood             Formation:
         Find the k items that have the most similar user
          vectors
                             users
                1   2   3   4   5   6   7   8   9   10   …   n

            1   1       1           1               1

            2       1                   1   1   1
  items




            3   1           1               1   1

            4       1           1       1   1   1

                        1               1
            …




            m               1                   1
Item-oriented CF – cont.
  Recommendation           Generation
                                1
                   W2-1

             2     W2-3         3
                                    Predict by taking weighted sum
                                4
                                    TopN Recmd. for U8 : {1,9,3}
  U8         4
                   W4-1         1


                                8
             8

       purchased   W8-9         9
         items
                   item-item
                   similarity
Item-oriented CF
Practical Implementation
    Compute and store all item-item similarities.

                                                               a•b
         Cosine similarity: sim(a,b) = cos(a,b) =
                                                              a ∗ b
                                                                2     2


    Find N items that will be most likely purchased by user u.
         Get all items purchased by u, save to Ipurchased
                     €
         For each item in Ipurchased, find k most similar items save them
          to Icandidate
         Remove unavailable items in Icandidate
         Get all items purchased by u, save to Ipurchased
         Take Icandidate – Ipurchased = Irecmd
         Re-order items in Irecmd based on
                                   ∑  j ∈ purchasedItems(u)
                                                              itemSim(i, j) * ruj
                       pred(u,i) =
                                    ∑    j ∈ purchasedItems(u)
                                                                 itemSim(i, j)
Outline
 Introduction

 User-oriented  Collaborative Filtering
 Item-oriented Collaborative Filtering

 Challenges

 Best Practices
The Challenges you will face
  Data Sparsity Issue
  Cold Start Problem
  Curse of Dimensionality

  Scalability
Data Sparsity Issue
  Missing       values in the Users / Items matrix.
                                  users
                                                                        Not
             1   2   3    4   5    6   7     8     9     10       …   n
                                                                      Reliable     Mostly
                                                                                  Unknown
         1

         2                           ∑                       userSim(u,v) * rvi
 items




                                       v ∈k−similarUser(u)
         3   1
                         pred(u,i) =         1
                                      ∑    v ∈k−similarUser( u)
                                                                  userSim(u,v)
         4                                   1

                          1
         …




         m   €

  NetflixPrize data set: 98.82% of cells are blank
  Typical e-commerce txn data set can be 10-100
   time more sparse than Netflix Prize data set !!
Cold Start Problem
    It occurs when new item or new user is added to the
     data matrix.
                                     users
                 1   2   3   4   5    6   7   8   9   10   …   n

             1   1       1            1               1

             2                            1   1   1
     items




             3   1   1       1                1   1            1

             4       1           1            1   1

                         1                1
             …




             m


    RE does not have enough knowledge about this new user or
     this new item yet.
    Content-based REs can be incorporated to alleviate cold start
     problem.
Curse of Dimensionality
  Adding  more features (items or users) can
   increase the noise, and hence the error.
  There aren’t enough observations to get good
   estimates. users
            items
Scalability
  User  neighborhood formation: O(n2) for n users
  Item neighborhood formation: O(m2) for m items

  When m (# of items) << n (# of users), item-based
   CF will be more efficient than user-based CF
  Ability to update neighborhood incrementally
Outline
 Introduction

 User-oriented  Collaborative Filtering
 Item-oriented Collaborative Filtering

 Challenges

 Best Practices
Best Practices
  Understand  the data thoroughly
  Define business objectives and conversion metrics
   judiciously
  Understand context and user intent

  Apply adaptive reinforcement learning

  Optimize RE using cost-based methods

  Be aware of data-shift issue

  Optimize marketing messages delivered with
   recommendation results
Understand the data thoroughly
  What data are available?
  E-commerce data set typically contains:
      Clickstream
      Shopping cart / Saved Items / Wish list / Shared Item
      Order / Return
      User profile
      User ratings
  How  are these data points being collected?
  Is there pre-existing bias in the data? or leakage?

  Is the data related to what we want to predict?
Define business objectives and
conversion metrics judiciously
  Definingcorrect conversion metrics can be a
  competitive advantage.


                     mitigate
                                   increase
                    navigation
                   inefficiency
                                  # of orders




                     up-sell       increase
       Web                                        more
                     within        average
     property      the context    item price
                                                 revenue




                    cross-sell    increase
                     within         items
                   the context    per order




                 Recommendation
                     Engine
Understand context and user
intent
  Context
        should be considered when RE making
  recommendation.

             Month: December
             Temperature: 45oF
Apply adaptive reinforcement
learning

                      C1

                      C2

                      C3



                                                                Incorporating clickstream
                                                                 adaptive reinforcement


                 ∑
                 j ∈ purchasedItems(u)
                                         itemSim(i, j) * ruj
     pred(u,i) =                                               + W iC i
                  ∑   j ∈ purchasedItems(u)
                                              itemSim(i, j)



 €
Optimize RE using cost-based
models
  Cost-based     engine optimization

        Metrics




                                        Parameter
Be aware of the data-shift issue
      Data    collection UI changes will influence data
         significantly, creating artificial data shifting




Netflix prize data set
Y. Koren, "Collaborative Filtering with Temporal Dynamics," Proc. 15th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data
Mining (KDD 09), ACM Press, 2009, pp. 447-455.12
Optimize marketing message
delivered with recommendation result
  It's   How You Say It




                           Screenshots from Amazon.com
RECOMMENDATION
ENGINE
DEMYSTIFIED
Alex Lin
Intelligent Mining
Email: alin@intelligentmining.com
Twitter: DKALab

Recommendation Engine Demystified

  • 1.
  • 2.
    Outline  Introduction  User-oriented CollaborativeFiltering  Item-oriented Collaborative Filtering  Challenges  Best Practices
  • 3.
    Recommendation Engine What isa Recommendation Engine (RE)?   RE takes “observation” data and uses machine learning / statistical algorithms to predict outcomes or levels of interest.   “Recommender systems form a specific type of information filtering (IF) technique that attempts to present information items (movies, music, books, news, images, web pages, etc.) that are likely of interest to the user.” – Wikipedia
  • 4.
    Recommendation Engine Demographic based Clustering Social Knowledge based Graph based based Content Collaborative based Filtering Category Mixture Models based   This presentation will focus on Neighborhood-based Collaborative Filtering   User-oriented method   Item-oriented method
  • 5.
    Neighborhood-based Collaborative Filtering Data Normalization Data Input Data (Observations) Representation User-user similarity vs. Item-item similarity Neighborhood Formation Make Recommendation Recommendation Generation Web Page, Email, Direct Mail, Marketing Campaign, Delivering etc. Mechanism
  • 6.
    Outline  Introduction  User-oriented CollaborativeFiltering  Item-oriented Collaborative Filtering  Challenges  Best Practices
  • 7.
    User-oriented CF   Input Data Representation: Users / Items matrix. users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 items 3 1 1 1 1 1 4 1 1 1 1 1 1 … m   Cell value “1” means user purchased the item.   Data Normalization is not shown on this slide.
  • 8.
    User-oriented CF   Neighborhood Formation:   Find the k most like-minded users in the system. users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 1 2 1 1 1 items 3 1 1 1 1 1 1 4 1 1 1 1 1 1 … m 1 1
  • 9.
    User-oriented CF   Neighborhood Formation:   Find the k most like-minded users in the system. users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 1 2 1 1 1 items 3 1 1 1 1 1 1 4 1 1 1 1 1 1 … m 1 1   Identify U9 and U2 are similar to U8
  • 10.
    User-oriented CF   Recommendation Generation: users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 1 2 1 1 1 items 3 1 1 1 1 1 1 4 1 1 1 1 1 1 … m 1 1   Identify I1 and I9 are not yet purchased by U8
  • 11.
    User-oriented CF   Recommendation Generation: users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 0.7 1 2 1 1 1 items 3 1 1 1 1 1 1 4 1 1 1 1 1 1 … m 1 0.9 1   Predict by taking weighted sum
  • 12.
    User-oriented CF Practical Implementation   Compute and store all user-user similarities. u•v   Cosine similarity: sim(u,v) = cos(u,v) = u ∗ v 2 2   Find N items that will be most likely purchased by user u.   Find k most similar users to u, save to Usim   Get all items purchased by Usim, save to Icandidate €   Remove unavailable items in Icandidate   Get all items purchased by u, save to Ipurchased   Take Icandidate – Ipurchased = Irecmd   Re-order items in Irecmd based on sum of user-user similarity ∑ v ∈k−similarUser(u) userSim(u,v) * rvi pred(u,i) = ∑ v ∈k−similarUser( u) userSim(u,v) €
  • 13.
    Outline  Introduction  User-oriented CollaborativeFiltering  Item-oriented Collaborative Filtering  Challenges  Best Practices
  • 14.
    Item-oriented CF   Input Data Representation: Users / Items matrix. users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 items 3 1 1 1 1 1 4 1 1 1 1 1 1 … m   Cell value “1” means user purchased the item.   Data Normalization is not shown on this slide.
  • 15.
    Item-oriented CF   Neighborhood Formation:   Find the k items that have the most similar user vectors users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 1 items 3 1 1 1 1 4 1 1 1 1 1 1 1 … m 1 1
  • 16.
    Item-oriented CF –cont.   Recommendation Generation 1 W2-1 2 W2-3 3 Predict by taking weighted sum 4 TopN Recmd. for U8 : {1,9,3} U8 4 W4-1 1 8 8 purchased W8-9 9 items item-item similarity
  • 17.
    Item-oriented CF Practical Implementation   Compute and store all item-item similarities. a•b   Cosine similarity: sim(a,b) = cos(a,b) = a ∗ b 2 2   Find N items that will be most likely purchased by user u.   Get all items purchased by u, save to Ipurchased €   For each item in Ipurchased, find k most similar items save them to Icandidate   Remove unavailable items in Icandidate   Get all items purchased by u, save to Ipurchased   Take Icandidate – Ipurchased = Irecmd   Re-order items in Irecmd based on ∑ j ∈ purchasedItems(u) itemSim(i, j) * ruj pred(u,i) = ∑ j ∈ purchasedItems(u) itemSim(i, j)
  • 18.
    Outline  Introduction  User-oriented CollaborativeFiltering  Item-oriented Collaborative Filtering  Challenges  Best Practices
  • 19.
    The Challenges youwill face   Data Sparsity Issue   Cold Start Problem   Curse of Dimensionality   Scalability
  • 20.
    Data Sparsity Issue  Missing values in the Users / Items matrix. users Not 1 2 3 4 5 6 7 8 9 10 … n Reliable Mostly Unknown 1 2 ∑ userSim(u,v) * rvi items v ∈k−similarUser(u) 3 1 pred(u,i) = 1 ∑ v ∈k−similarUser( u) userSim(u,v) 4 1 1 … m €   NetflixPrize data set: 98.82% of cells are blank   Typical e-commerce txn data set can be 10-100 time more sparse than Netflix Prize data set !!
  • 21.
    Cold Start Problem   It occurs when new item or new user is added to the data matrix. users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 items 3 1 1 1 1 1 1 4 1 1 1 1 1 1 … m   RE does not have enough knowledge about this new user or this new item yet.   Content-based REs can be incorporated to alleviate cold start problem.
  • 22.
    Curse of Dimensionality  Adding more features (items or users) can increase the noise, and hence the error.   There aren’t enough observations to get good estimates. users items
  • 23.
    Scalability   User neighborhood formation: O(n2) for n users   Item neighborhood formation: O(m2) for m items   When m (# of items) << n (# of users), item-based CF will be more efficient than user-based CF   Ability to update neighborhood incrementally
  • 24.
    Outline  Introduction  User-oriented CollaborativeFiltering  Item-oriented Collaborative Filtering  Challenges  Best Practices
  • 25.
    Best Practices   Understand the data thoroughly   Define business objectives and conversion metrics judiciously   Understand context and user intent   Apply adaptive reinforcement learning   Optimize RE using cost-based methods   Be aware of data-shift issue   Optimize marketing messages delivered with recommendation results
  • 26.
    Understand the datathoroughly   What data are available?   E-commerce data set typically contains:   Clickstream   Shopping cart / Saved Items / Wish list / Shared Item   Order / Return   User profile   User ratings   How are these data points being collected?   Is there pre-existing bias in the data? or leakage?   Is the data related to what we want to predict?
  • 27.
    Define business objectivesand conversion metrics judiciously   Definingcorrect conversion metrics can be a competitive advantage. mitigate increase navigation inefficiency # of orders up-sell increase Web more within average property the context item price revenue cross-sell increase within items the context per order Recommendation Engine
  • 28.
    Understand context anduser intent   Context should be considered when RE making recommendation. Month: December Temperature: 45oF
  • 29.
    Apply adaptive reinforcement learning C1 C2 C3 Incorporating clickstream adaptive reinforcement ∑ j ∈ purchasedItems(u) itemSim(i, j) * ruj pred(u,i) = + W iC i ∑ j ∈ purchasedItems(u) itemSim(i, j) €
  • 30.
    Optimize RE usingcost-based models   Cost-based engine optimization Metrics Parameter
  • 31.
    Be aware ofthe data-shift issue   Data collection UI changes will influence data significantly, creating artificial data shifting Netflix prize data set Y. Koren, "Collaborative Filtering with Temporal Dynamics," Proc. 15th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 09), ACM Press, 2009, pp. 447-455.12
  • 32.
    Optimize marketing message deliveredwith recommendation result   It's How You Say It Screenshots from Amazon.com
  • 33.