SlideShare a Scribd company logo
Benchmarking NODE against Collaborative Filtering
              Albert Azout                          Giri Iyengar, PhD
        Sociocast Networks LLC                  Sociocast Networks LLC
          New York, New York                      New York, New York
       albert.azout@sociocast.com              giri.iyengar@sociocast.com

                                  January 4, 2013


                                      Abstract
    We benchmark Sociocast’s proprietary NODE algorithm against the popular col-
laborative filtering algorithm for the task of predicting Social bookmarking activity of
internet users. Our results indicate that NODE performs between 4 and 10 times better
in precision, recall and F1 score compared to collaborative filtering. This performance
was holds across varying levels of the prediction window.


1 Introduction
Recommender systems have become widely used across many industries - e.g. Netflix
for movies, Pandora for music, and Amazon for consumer products. This technology
has also been studied in statistics and machine learning research communities. The
underlying problem is to form product recommendations based on previously recorded
data [1, 11]. Better recommendations can improve customer loyalty by helping cus-
tomers find products of interest they were previously unaware of [2].
    Ansari, Essegaier, and Kohli (2000) categorize recommendation systems into two
types: collaborative filtering and content-based approaches [3]. In this short study,
we benchmark the performance of Sociocast’s NODE algorithm against collaborative
filtering. NODE incorporates the time dimension into its core similarity function be-
tween users, whereas in collaborative filtering the introduction of temporal dynamics
is usually via additional parameters that result in a very large number of parameters
which makes the algorithm unscalable and impractical for most use cases (cf. Netflix
prize winning algorithm).


2 Testing Methodology
2.1 Delicious Dataset
We use the Delicious (a Yahoo! company) dataset that is publicly available. This
dataset represents bookmarking activity by 210,000 users on the www.delicious.com


                                          1
website over a period of 10 days (Sept 5th, 2009 - Sept 14th, 2009). The first eight
days are given to both algorithms for training, and the last two days are withheld as
the ground truth for testing. We restrict the dataset to only those users who had at
least 10 bookmarks in this period. This represents 14337 users and 600752 bookmarks
over the 8-day training period and another 136164 bookmarks over the test period.

2.2 Bookmark Classification
Each user provided bookmark corresponds to a live URL. We classify each URL into
a space of 434 classes using a proprietary machine learning based classifier trained
on a custom-curated corpora. These classes correspond to the 2nd level of the IAB
standard taxonomy. An example classification of a URL could be “Sports, Basketball”
or “Technology Products, Laptops”. Each URL is allowed up to three classifications.
The prediction task can then be thought of as predicting which classes or topics each
user will bookmark next, based on their previous bookmarking activity.

2.3 Collaborative Filtering
User-based collaborative filtering [4, 5, 6] is a memory-based algorithm which mimics
the word-of-mouth behavior for rating data. The intuition is that users with similar
preferences will rate items similarly. Missing ratings for a user can be predicted by
finding a neighborhood of similar users and then aggregating the ratings of these users
to form a prediction. A neighborhood of similar users can be defined with either the
Pearson correlation coefficient or cosine similarity:

                                                              ¯     ¯
                                                      i∈I (xi x)(yi y)
                      simP earson (x, y)       =                                       (1)
                                                   (|I| − 1)sd(x)sd(y)
                                                     x·y
                        simcosine (x, y)       =                                       (2)
                                                   ||x||||y||

where I is the set of items, x and y represent the row vectors in the rating matrix R
of two users’ profile vectors, sd(·) is the standard deviation and || · || is the l2 norm of
a vector. Once the users in a neighborhood of an active user N (a) ⊂ U are found by
taking a threshold on the similarity or by taking the k nearest neighbors, the easiest
way to form predicted ratings is to average the ratings in the neighborhood:
                                           1
                              raj =
                              ˆ                            sai rij                     (3)
                                      i∈N (a)sai i∈N (a)


    where sai is the similarity between the active user ua and user ui in the neighbor-
hood.
    In some data sets where numeric ratings are not appropriate or only binary data
is available, a version of CF using 0-1 data is available [7, 8]. The Delicious dataset is
best represented this way, where each rating rjk ∈ {0, 1} can be defined as:

                               1 if user uj bookmarked item ik
                      rjk =
                               0 otherwise.


                                               2
A similarity measure which only focuses on matching ones and avoids the ambiguity
of zeroes representing either missing ratings or negative examples is the Jaccard index:

                                                     |X ∩ Y|
                              simJaccard (X , Y) =                                    (4)
                                                     |X ∪ Y|

where X and Y are the sets of the items with a 1 in user profiles ua and ub , respectively.


3 Evaluation and Results
We ask each algorithm to generate the top-N recommended items for each user (where
N can vary), based on the training period. Each recommended item can then be
checked whether or not it appears in the withheld ground truth period. The results
can be summarized with the classical binary classification confusion matrix. Precision,
recall, and F1 are popular metrics used in information retrieval [9, 10]:

                                  correctly recommended items
                    P recision =
                                    total recommended items
                                  correctly recommended items
                         Recall =
                                  total useful recommendations
                                     P recision · Recall
                            F1 = 2 ·
                                     P recision + Recall
    The tables below summarize the performance of the two algorithms for different
levels of N , where N is the number of recommendations for each user each algorithm
is forced to make. Each recommendation is then evaluated against the ground truth
set, then tallied using precision, recall, and F1.

                                        Precision
                    N     NODE      CF       Factor of Improvement
                    1     35.31%    4.22% 8.37
                    2     31.01%    3.11% 9.93
                    5     23.50%    3.66% 6.42
                    10    18.19%    3.03% 6.01
                    15    15.05%    3.87% 3.89
                                         Recall
                    N     NODE      CF       Factor of Improvement
                    1     5.43%     0.65% 8.37
                    2     9.53%     0.96% 9.93
                    5     18.06%    2.81% 6.42
                    10    27.97%    4.65% 6.01
                    15    34.68%    8.92% 3.89




                                            3
F1 score
                   N    NODE       CF      Factor of Improvement
                   1    9.41%      1.12% 8.37
                   2    14.58%     1.46% 9.93
                   5    20.43%     3.18% 6.42
                   10   22.04%     3.67% 6.01
                   15   20.98%     5.39% 3.89
    NODE consistently outperforms CF by a factor of 3.89 to 9.93 in both precision
and recall. Note that the factor of improvement is consistent across all metrics, since
both algorithms are forced make the same number of predictions, and the ground truth
set is also the same for both algorithms.


References
[1] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Analysis of recommendation
    algorithms for e-commerce. In EC ’00: Proceedings of the 2nd ACM conference on
    Electronic commerce, pages 158–167. ACM, 2000. ISBN 1-58113-272-7.
[2] J. B. Schafer, J. A. Konstan, and J. Riedl. E-commerce recommendation applica-
    tions. Data Mining and Knowledge Discovery, 5(1/2):115–153, 2001.
[3] A. Ansari, S. Essegaier, and R. Kohli. Internet recommendation systems. Journal
    of Marketing Research, 37:363–375, 2000.
[4] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to
    weave an information tapestry. Communications of the ACM, 35(12):61–70, 1992.
    ISSN 0001-0782. doi: http://doi.acm.org/10.1145/138859.138867.
[5] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: an open
    archi- tecture for collaborative filtering of netnews. In CSCW ’94: Proceedings of
    the 1994 ACM conference on Computer supported cooperative work, pages 175–186.
    ACM, 1994. ISBN 0-89791-689-1. doi: http://doi.acm.org/10.1145/192844.192905.
[6] U. Shardanand and P. Maes. Social information filtering: Algorithms for automat-
    ing ’word of mouth’. In Conference proceedings on Human factors in computing
    systems (CHI’95), pages 210–217, Denver, CO, May 1995. ACM Press/Addison-
    Wesley Publishing Co.
[7] A. Mild and T. Reutterer. An improved collaborative filtering approach for pre-
    dicting cross- category purchases based on binary market basket data. Journal of
    Retailing and Consumer Services, 10(3):123–133, 2003.
[8] J.-S. Lee, C.-H. Jun, J. Lee, and S. Kim. Classification-based collaborative filter-
    ing using market basket data. Expert Systems with Applications, 29(3):700–704,
    October 2005.
[9] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-
    Hill, New York, 1983.
[10] C. van Rijsbergen. Information retrieval. Butterworth, London, 1979.
[11] M. Hahsler. recommenderlab: A Framework for Developing and Testing Recom-
    mendation Algorithms. 2011.


                                          4

More Related Content

What's hot

slides
slidesslides
slides
butest
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
ijcsit
 
Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...
Antonio Maria Fiscarelli
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text
IJERA Editor
 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ijsc
 
2012 predictive clusters
2012 predictive clusters2012 predictive clusters
2012 predictive clusters
Alejandro Correa Bahnsen, PhD
 
report
reportreport
report
Arthur He
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
Sajith Edirisinghe
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
SSA KPI
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
csandit
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
csandit
 
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis &  Viola-Jones AlgorithmGesture Recognition using Principle Component Analysis &  Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
IJMER
 
2 ueda
2 ueda2 ueda
V.karthikeyan published article a..a
V.karthikeyan published article a..aV.karthikeyan published article a..a
V.karthikeyan published article a..a
KARTHIKEYAN V
 

What's hot (14)

slides
slidesslides
slides
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text
 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
 
2012 predictive clusters
2012 predictive clusters2012 predictive clusters
2012 predictive clusters
 
report
reportreport
report
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis &  Viola-Jones AlgorithmGesture Recognition using Principle Component Analysis &  Viola-Jones Algorithm
Gesture Recognition using Principle Component Analysis & Viola-Jones Algorithm
 
2 ueda
2 ueda2 ueda
2 ueda
 
V.karthikeyan published article a..a
V.karthikeyan published article a..aV.karthikeyan published article a..a
V.karthikeyan published article a..a
 

Viewers also liked

да се храним с пълноценна храна всеки ден
да се храним с пълноценна храна всеки денда се храним с пълноценна храна всеки ден
да се храним с пълноценна храна всеки ден
LunchboxEurope
 
Finanzas
FinanzasFinanzas
Finanzas
cupidosantizo
 
Obesity
ObesityObesity
BSS-Taiyari Publication_CaseStories
BSS-Taiyari Publication_CaseStoriesBSS-Taiyari Publication_CaseStories
BSS-Taiyari Publication_CaseStories
BalSansarSanstha Ngo
 
Ренесанс
РенесансРенесанс
Ренесанс
RaynaITSTEP
 
Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...
Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...
Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...
I W
 
Cara perceptive computing software
Cara perceptive computing softwareCara perceptive computing software
Cara perceptive computing software
IMRSV Inc.
 
Tutorial
TutorialTutorial
Tutorial
Ahmet Ozkok
 
Payal Mahajan
Payal MahajanPayal Mahajan
Алекс Живите организми
Алекс Живите организмиАлекс Живите организми
Алекс Живите организми
Elisaveta Ivanova
 
Презентация за Оскар Уайлд
Презентация за Оскар УайлдПрезентация за Оскар Уайлд
Презентация за Оскар Уайлд
Elisaveta Ivanova
 
Балчик - Ива
Балчик - ИваБалчик - Ива
Балчик - Ива
Elisaveta Ivanova
 
Морските кончета на Ива
Морските кончета на ИваМорските кончета на Ива
Морските кончета на Ива
Elisaveta Ivanova
 
необходими пособия-за-2.клас
необходими пособия-за-2.класнеобходими пособия-за-2.клас
необходими пособия-за-2.клас
daniela velcheva
 
Kоледа Дея
Kоледа ДеяKоледа Дея
Kоледа Дея
Elisaveta Ivanova
 
LESIONES INFLAMATORIAS DE LOS MAXILARES
LESIONES INFLAMATORIAS DE LOS MAXILARESLESIONES INFLAMATORIAS DE LOS MAXILARES
LESIONES INFLAMATORIAS DE LOS MAXILARES
Alejandra Rodriguez
 
Altersdepression vs. Demenz
Altersdepression vs. DemenzAltersdepression vs. Demenz
Altersdepression vs. Demenz
Niederrheinischer Pflegekongress
 

Viewers also liked (17)

да се храним с пълноценна храна всеки ден
да се храним с пълноценна храна всеки денда се храним с пълноценна храна всеки ден
да се храним с пълноценна храна всеки ден
 
Finanzas
FinanzasFinanzas
Finanzas
 
Obesity
ObesityObesity
Obesity
 
BSS-Taiyari Publication_CaseStories
BSS-Taiyari Publication_CaseStoriesBSS-Taiyari Publication_CaseStories
BSS-Taiyari Publication_CaseStories
 
Ренесанс
РенесансРенесанс
Ренесанс
 
Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...
Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...
Rekommunalisierung – Gefährden die Privilegien öffentlicher Unternehmen die m...
 
Cara perceptive computing software
Cara perceptive computing softwareCara perceptive computing software
Cara perceptive computing software
 
Tutorial
TutorialTutorial
Tutorial
 
Payal Mahajan
Payal MahajanPayal Mahajan
Payal Mahajan
 
Алекс Живите организми
Алекс Живите организмиАлекс Живите организми
Алекс Живите организми
 
Презентация за Оскар Уайлд
Презентация за Оскар УайлдПрезентация за Оскар Уайлд
Презентация за Оскар Уайлд
 
Балчик - Ива
Балчик - ИваБалчик - Ива
Балчик - Ива
 
Морските кончета на Ива
Морските кончета на ИваМорските кончета на Ива
Морските кончета на Ива
 
необходими пособия-за-2.клас
необходими пособия-за-2.класнеобходими пособия-за-2.клас
необходими пособия-за-2.клас
 
Kоледа Дея
Kоледа ДеяKоледа Дея
Kоледа Дея
 
LESIONES INFLAMATORIAS DE LOS MAXILARES
LESIONES INFLAMATORIAS DE LOS MAXILARESLESIONES INFLAMATORIAS DE LOS MAXILARES
LESIONES INFLAMATORIAS DE LOS MAXILARES
 
Altersdepression vs. Demenz
Altersdepression vs. DemenzAltersdepression vs. Demenz
Altersdepression vs. Demenz
 

Similar to Sociocast CF Benchmark

The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
IRJET Journal
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
AllenWu
 
IRJET- A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET-  	  A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...IRJET-  	  A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET- A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET Journal
 
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...
Daniel Valcarce
 
Detection of Attentiveness from Periocular Information
Detection of Attentiveness from Periocular InformationDetection of Attentiveness from Periocular Information
Detection of Attentiveness from Periocular Information
IRJET Journal
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
준식 최
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
amreshkr19
 
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyRecommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Vito Walter Anelli
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
Anshika865276
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
Quality assessment for online iris
Quality assessment for online irisQuality assessment for online iris
Quality assessment for online iris
csandit
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
IJECEIAES
 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
ijcseit
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET Journal
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET Journal
 
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
ieijjournal1
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
TanyaWadhwani4
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
AIRCC Publishing Corporation
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ijcsit
 

Similar to Sociocast CF Benchmark (20)

The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
IRJET- A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET-  	  A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...IRJET-  	  A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET- A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
 
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...
 
Detection of Attentiveness from Periocular Information
Detection of Attentiveness from Periocular InformationDetection of Attentiveness from Periocular Information
Detection of Attentiveness from Periocular Information
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyRecommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Quality assessment for online iris
Quality assessment for online irisQuality assessment for online iris
Quality assessment for online iris
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
 

Recently uploaded

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 

Recently uploaded (20)

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 

Sociocast CF Benchmark

  • 1. Benchmarking NODE against Collaborative Filtering Albert Azout Giri Iyengar, PhD Sociocast Networks LLC Sociocast Networks LLC New York, New York New York, New York albert.azout@sociocast.com giri.iyengar@sociocast.com January 4, 2013 Abstract We benchmark Sociocast’s proprietary NODE algorithm against the popular col- laborative filtering algorithm for the task of predicting Social bookmarking activity of internet users. Our results indicate that NODE performs between 4 and 10 times better in precision, recall and F1 score compared to collaborative filtering. This performance was holds across varying levels of the prediction window. 1 Introduction Recommender systems have become widely used across many industries - e.g. Netflix for movies, Pandora for music, and Amazon for consumer products. This technology has also been studied in statistics and machine learning research communities. The underlying problem is to form product recommendations based on previously recorded data [1, 11]. Better recommendations can improve customer loyalty by helping cus- tomers find products of interest they were previously unaware of [2]. Ansari, Essegaier, and Kohli (2000) categorize recommendation systems into two types: collaborative filtering and content-based approaches [3]. In this short study, we benchmark the performance of Sociocast’s NODE algorithm against collaborative filtering. NODE incorporates the time dimension into its core similarity function be- tween users, whereas in collaborative filtering the introduction of temporal dynamics is usually via additional parameters that result in a very large number of parameters which makes the algorithm unscalable and impractical for most use cases (cf. Netflix prize winning algorithm). 2 Testing Methodology 2.1 Delicious Dataset We use the Delicious (a Yahoo! company) dataset that is publicly available. This dataset represents bookmarking activity by 210,000 users on the www.delicious.com 1
  • 2. website over a period of 10 days (Sept 5th, 2009 - Sept 14th, 2009). The first eight days are given to both algorithms for training, and the last two days are withheld as the ground truth for testing. We restrict the dataset to only those users who had at least 10 bookmarks in this period. This represents 14337 users and 600752 bookmarks over the 8-day training period and another 136164 bookmarks over the test period. 2.2 Bookmark Classification Each user provided bookmark corresponds to a live URL. We classify each URL into a space of 434 classes using a proprietary machine learning based classifier trained on a custom-curated corpora. These classes correspond to the 2nd level of the IAB standard taxonomy. An example classification of a URL could be “Sports, Basketball” or “Technology Products, Laptops”. Each URL is allowed up to three classifications. The prediction task can then be thought of as predicting which classes or topics each user will bookmark next, based on their previous bookmarking activity. 2.3 Collaborative Filtering User-based collaborative filtering [4, 5, 6] is a memory-based algorithm which mimics the word-of-mouth behavior for rating data. The intuition is that users with similar preferences will rate items similarly. Missing ratings for a user can be predicted by finding a neighborhood of similar users and then aggregating the ratings of these users to form a prediction. A neighborhood of similar users can be defined with either the Pearson correlation coefficient or cosine similarity: ¯ ¯ i∈I (xi x)(yi y) simP earson (x, y) = (1) (|I| − 1)sd(x)sd(y) x·y simcosine (x, y) = (2) ||x||||y|| where I is the set of items, x and y represent the row vectors in the rating matrix R of two users’ profile vectors, sd(·) is the standard deviation and || · || is the l2 norm of a vector. Once the users in a neighborhood of an active user N (a) ⊂ U are found by taking a threshold on the similarity or by taking the k nearest neighbors, the easiest way to form predicted ratings is to average the ratings in the neighborhood: 1 raj = ˆ sai rij (3) i∈N (a)sai i∈N (a) where sai is the similarity between the active user ua and user ui in the neighbor- hood. In some data sets where numeric ratings are not appropriate or only binary data is available, a version of CF using 0-1 data is available [7, 8]. The Delicious dataset is best represented this way, where each rating rjk ∈ {0, 1} can be defined as: 1 if user uj bookmarked item ik rjk = 0 otherwise. 2
  • 3. A similarity measure which only focuses on matching ones and avoids the ambiguity of zeroes representing either missing ratings or negative examples is the Jaccard index: |X ∩ Y| simJaccard (X , Y) = (4) |X ∪ Y| where X and Y are the sets of the items with a 1 in user profiles ua and ub , respectively. 3 Evaluation and Results We ask each algorithm to generate the top-N recommended items for each user (where N can vary), based on the training period. Each recommended item can then be checked whether or not it appears in the withheld ground truth period. The results can be summarized with the classical binary classification confusion matrix. Precision, recall, and F1 are popular metrics used in information retrieval [9, 10]: correctly recommended items P recision = total recommended items correctly recommended items Recall = total useful recommendations P recision · Recall F1 = 2 · P recision + Recall The tables below summarize the performance of the two algorithms for different levels of N , where N is the number of recommendations for each user each algorithm is forced to make. Each recommendation is then evaluated against the ground truth set, then tallied using precision, recall, and F1. Precision N NODE CF Factor of Improvement 1 35.31% 4.22% 8.37 2 31.01% 3.11% 9.93 5 23.50% 3.66% 6.42 10 18.19% 3.03% 6.01 15 15.05% 3.87% 3.89 Recall N NODE CF Factor of Improvement 1 5.43% 0.65% 8.37 2 9.53% 0.96% 9.93 5 18.06% 2.81% 6.42 10 27.97% 4.65% 6.01 15 34.68% 8.92% 3.89 3
  • 4. F1 score N NODE CF Factor of Improvement 1 9.41% 1.12% 8.37 2 14.58% 1.46% 9.93 5 20.43% 3.18% 6.42 10 22.04% 3.67% 6.01 15 20.98% 5.39% 3.89 NODE consistently outperforms CF by a factor of 3.89 to 9.93 in both precision and recall. Note that the factor of improvement is consistent across all metrics, since both algorithms are forced make the same number of predictions, and the ground truth set is also the same for both algorithms. References [1] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Analysis of recommendation algorithms for e-commerce. In EC ’00: Proceedings of the 2nd ACM conference on Electronic commerce, pages 158–167. ACM, 2000. ISBN 1-58113-272-7. [2] J. B. Schafer, J. A. Konstan, and J. Riedl. E-commerce recommendation applica- tions. Data Mining and Knowledge Discovery, 5(1/2):115–153, 2001. [3] A. Ansari, S. Essegaier, and R. Kohli. Internet recommendation systems. Journal of Marketing Research, 37:363–375, 2000. [4] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12):61–70, 1992. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/138859.138867. [5] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: an open archi- tecture for collaborative filtering of netnews. In CSCW ’94: Proceedings of the 1994 ACM conference on Computer supported cooperative work, pages 175–186. ACM, 1994. ISBN 0-89791-689-1. doi: http://doi.acm.org/10.1145/192844.192905. [6] U. Shardanand and P. Maes. Social information filtering: Algorithms for automat- ing ’word of mouth’. In Conference proceedings on Human factors in computing systems (CHI’95), pages 210–217, Denver, CO, May 1995. ACM Press/Addison- Wesley Publishing Co. [7] A. Mild and T. Reutterer. An improved collaborative filtering approach for pre- dicting cross- category purchases based on binary market basket data. Journal of Retailing and Consumer Services, 10(3):123–133, 2003. [8] J.-S. Lee, C.-H. Jun, J. Lee, and S. Kim. Classification-based collaborative filter- ing using market basket data. Expert Systems with Applications, 29(3):700–704, October 2005. [9] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw- Hill, New York, 1983. [10] C. van Rijsbergen. Information retrieval. Butterworth, London, 1979. [11] M. Hahsler. recommenderlab: A Framework for Developing and Testing Recom- mendation Algorithms. 2011. 4