• Cognizant 20-20 InsightsHow to Develop Online RecommendationSystems that Deliver Superior BusinessPerformance   Executiv...
learning constructs. Leading e-commerce players        Classification is a technique used to decideuse recommendation engi...
The Mahout framework is highly flexible and lets          highly scalable. Mahout is considered a superiordevelopers custo...
Providing Recommendations                               daily to create the rating data for new resources.                ...
for good quality input data, also arose. Common                       The next challenge is providing recommenda-algorithm...
Finally, sometimes providing recommendations to             can be an important means of bolstering salesusers only on pro...
About the AuthorsAnup Prakash Warade is a Cognizant Associate. He develops and supports multiple projects for keyclients i...
Upcoming SlideShare
Loading in …5

How to Develop Online Recommendation Systems that Deliver Superior Business Performance


Published on

Supported by artificial intelligence, intelligent recommendation systems are a boon for e-commerce and other business activities. This paper examines how clustering, data classification, and collaborative filtering as well as a number of algorithms are harnessed to make recommendation systems powerful tools.

Published in: Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

How to Develop Online Recommendation Systems that Deliver Superior Business Performance

  1. 1. • Cognizant 20-20 InsightsHow to Develop Online RecommendationSystems that Deliver Superior BusinessPerformance Executive Summary combating this issue is what is known as a recom- mendation system. Over the past two decades, the Internet has emerged as the mainstream medium for online Many major e-commerce Websites are already shopping, social networking, e-mail and more. using recommendation systems to provide Corporations also view the Web as a potential relevant suggestions to their customers. The business accelerator. They see the huge volume of recommendations could be based on various transactional and interaction data generated by parameters, such as items popular on the the Internet as R&D that informs the creation of company’s Website; user characteristics such as new and more competitive services and products. geographical location or other demographic infor- mation; or past buying behavior of top customers. Several “e-movement” crusaders have discovered that customers spend significant amounts of This white paper presents an overview of how time researching products they seek before we are helping a Fortune 500 organization to purchasing. In a bid to assist customers in these implement a recommendation system. Moreover, efforts, and conserve precious time, these orga- this paper also sheds light on key challenges that nizations offer users suggestions of products may be encountered during implementation of a they may be interested in. This serves the dual recommendation system built on the open source purpose of not just attracting browsers but Apache Mahout,1 a large-data library of statistical converting them into buyers. For instance, an and analytical algorithms. online bookstore may know that a customer has interest in mobile technology based on previous Recommendation Systems site visits and suggest relevant titles to purchase. Recommendation systems can be considered An uninitiated user may be impressed by such as a valuable extension of traditional informa- suggestions. Suggestions (or “recommenda- tion systems used in industries such as travel tions” as they are popularly known) predict likes and hospitality. However, recommendation and dislikes of users. To offer meaningful rec- systems have mathematical roots and are more ommendations to site visitors, these companies akin to artificial intelligence (AI) than any other need to store huge amounts of data pertaining IT discipline. A recommendation system learns to different user profiles and their correspond- from a customer’s behavior and recommends a ing interests. This eventually culminates in infor- product in which users may be interested. At the mation overload, or difficulty in understanding heart of recommendation systems are machine- and making informed decisions. One solution to cognizant 20-20 insights | january 2012
  2. 2. learning constructs. Leading e-commerce players Classification is a technique used to decideuse recommendation engines that sift users’ past whether new input or a search term matchespurchase histories to recommend products such a previously observed pattern. It is also usedas magazine articles, books, goods, etc. Here is to detect suspicious network activity. Yahoo!how major e-commerce companies use recom- Mail8 uses classification to decide if an incomingmendation engines to improve their sales and message is spam. Image sharing sites like Picasa9their customers’ shopping experience. use classification techniques to determine whether photos contain human faces. They• Amazon.com: Depending on past purchases then offer recommendations of people that are and user activity, the site recommends prod- identified in the user contacts list.10 ucts of user interest.2• Netflix: Recommends DVDs in which a user A Robust System to Counter may be interested by category like drama, Information Overload comedy, action, etc. Netflix went so far as to We are working with a leading multinational man- offer a $1 million3 prize to researchers who ufacturing company that has numerous product could improve its recommendation engine. research labs with many scientists and research-• eBay: Collects user feedback about its prod- ers in numerous countries working on different ucts which is then used to recommend prod- technologies. To help facilitate scientific research, ucts to users who have exhibited similar be- and to buy the latest technology information, this haviors.4 client partnered with information providers such as Scopus, Knovel, etc. But despite these dataOnline companies that leverage recommendation sources, scientists and researchers were oftensystems can increase sales by 8% to12%.5 unable to find the right information to improve their research. Also, scientists across the globeCompanies that succeed with recommendation were unable to collaborate and share technicalengines are those that can quickly and efficiently information with each other. This situation isturn vast amounts of data into actionable infor- typical of companies dealing with informationmation. overload.Anatomy of a Recommendation Engine To increase the informational awareness ofThe key component of a recommendation system scientists and other employees, the client wantedis data. This data may be garnered by a variety to create a system to recommend resources likeof means such as customer ratings of products, patents, articles and journals from paid contentfeedback/reviews from purchasers, etc. This data providers. A successful system needed to learnwill serve as the basis for recommendations to from user searches and be intelligent enough tousers. After data collection, recommendation recommend popular resources similar to the onessystems use machine-learning algorithms to that a user is currently working from. The systemfind similarities and affinities between products was also expected to provide scientists withand users. Recommender logic programs are useful insights on information other scientiststhen used to build suggestions for specific user across the globe are using. Finally, the systemprofiles. This technique of filtering the input data was to serve as a platform to connect scientistsand giving recommendations to users is also working on similar technologies.known as “collaborative filtering.”6 We helped to design and develop the system,Along with collaborative filtering, recommenda- which was dubbed “intelligent recommendationtion systems also use other machine-learning system” (IRS). Many of the problems the orga-techniques such as clustering and classification nization faced originated from the multiple pref-of data. Clustering is a technique which is used to erences and needs of users pertaining to theirbundle large amounts of data together into similar individual research topics. To make the systemcategories. It is also used to see data patterns and adaptive to specific user requirements, therender huge amounts of data simpler to manage. solution proposed was to use a recommendationFor instance, Google News7 creates clusters of system. As a first step towards the solution, thesimilar news information when grouping diverse large information base possessed by the clientarrays of news articles. Many other search was categorized/grouped by specific criteria.engines use clustering to group results for similar After much contemplation of data and size of thesearch terms. user base, we decided to implement the system using the Apache Mahout framework. cognizant 20-20 insights 2
  3. 3. The Mahout framework is highly flexible and lets highly scalable. Mahout is considered a superiordevelopers customize outcomes according to way of building recommendation systemstheir ad hoc requirements. We then developed because it implements all the three machinea customized algorithm to recommend relevant learning techniques — collabora-resources to scientists and researchers. tive filtering, clustering and clas- Mahout is considered sification. Collaborative filtering a superior wayBuilding an Intelligent is the primary technique usedRecommendation System by Apache Mahout to provide of buildingThe Solution recommendations. Given rating recommendation data along with a set of users systems becauseThe most important purpose of an intelligent and items, collaborative filteringrecommendation system (IRS) is to increase generates recommendations in it implements allawareness among scientists about the areas inwhich they are exploring, technologies on which one of the following four ways. the three machinetheir colleagues are working, and information learning techniques • User-based: Recommenda-about experts and their views on that particular tions are made based on us- — collaborativediscipline. Despite the possession of hugeamounts of data, there was very little insight on ers with similar characteris- filtering, clustering tics.the information scientists were seeking. There and classification.are fundamental differences designing a recom- • Item-based: Recommenda-mendation system compared with traditional tions are based on similar items.software design. The overall system architec- • Slope-one: A fast technique that offers rec-ture depends heavily on the choice of algorithms ommendations based on previous user-ratedand the system architecture employed. By using items.Apache Mahout, the team selected a conven-tional open source framework that implements • Model-based: This approach compares the profile of an active user to aggregate usermachine-learning algorithms. clusters, rather than the concrete profiles.Apache Mahout is a new Apache Software There are many algorithms that are used toFoundation (ASF) project whose primary goal is calculate similarities between two entities. Theto create scalable machine-learning algorithms choice of algorithm plays a vital role in decidingthat are free to use under the Apache license. the quality of the recommendation that is mostThe term “Mahout” is derived from the Hindi suited for a given scenario. Since the IRS forword that means elephant driver. The Mahout this client needed to offer recommendations toproject started in 2008 as a subproject of scientists based on their multiple preferences,Apache’s Lucene project, which is a popular we adopted the user-based collaborative filteringsearch engine. Given the amount of data that the (see Figure 1).client possessed, it was imperative that the IRS beRecommendation System Execution Flow Data Model Neighborhoods Intelligent Similarity User Recommendation Alogrithms System Recommendations Evaluator Recommender Program Logic ProgramFigure 1 cognizant 20-20 insights 3
  4. 4. Providing Recommendations daily to create the rating data for new resources. Creating an Input Finding Similarity Between Users/Items Within the process of generating recommenda- After creating a data model, the next step in tions, a crucial step is for the IRS to generate building an IRS is finding similarities and patterns rating and usage data. Such data could be applied between users and items. There are many to which articles are popular and which ones had algorithms which can be used to find item simi- the most read counts. To collect and consolidate larities: GenericUserSimilarity, TanimotoCoef- this information, the IRS we ficientSimilarity, PearsonCorrelationSimilarity, Our IRS used used applied the Boolean SpearsonCorrelationSimilarity and EuclideanDis- data model. This technique a customized tracks which article/product tanceSimilarity, to name a few. By far the most common algorithm is TanimotoCoefficientSimi- recommender logic was visited by each particular larity as it takes all types of input data, such as program which user. This strategy is widely binary or numbers. TanimatoCoefficient provides used in “social bookmarking”analyses and matches sites. In the IRS developed similarity based on the ratings of each item. In the IRS we were building, due to the lack of rating data the ratings of one for this client, articles were for all items, our team decided to customize the user with similar rated between a range of 1 to similarity algorithm based on user demographic 2, depending on the number user ratings and of times an article was visited properties. The demographic properties such as country, division, job title, department, etc. wererecommends articles/ or read. A decimal number in used to establish similarity among users. The items which match the range closer to 1 signifies output of this algorithm is pivotal as it is the base fewer hits while numbers the user interest. closer to 2 signify higher on top of which recommendations are built. usage. This rated data is then Generating Recommendations grouped into one of the following data models. In Once similarity is established, special recom- this context, data model refers to the abstraction mender logic programs are used to recommend layer which encapsulates product data and cor- items to users. This is very similar to user prefer- responding user ratings. ences. Our IRS used a customized recommender logic program which analyses and matches the The following are the data models that can be ratings of one user with similar user ratings and used to create an IRS depending on input data. recommends articles/items which match the user interest. The output of this recommender logic • In-memory data model: Data is prepared using programs which make objects of the data program are the actual recommendations for with product properties like product id, userid the user. In the case of multiple domains such as and ratings. fields of science (ceramics, nanotechnology, etc.), recommendations might not be accurate since • File-based data: Data is provided in a comma- user interests and expertise vary significantly. separated file along with a preference file. For such scenarios, recommendations should Implementation of a data refresh module will be evaluated with user domain or user demo- be required to reload data into the comma- graphics. Programs which aid in cross-verifica- separated file. tion of the results of recommendation engines • Database-based data: Relational databases are known as “evaluators.” Our IRS validates will be a good choice if the data is in the order recommended items with different relevance of gigabytes. However, the implication is that parameters such as scientist domain, technology a recommendation engine based on this model and scientist interest. The output of this program will be much slower than in-memory data rep- proved beneficial in offering concrete recommen- resentations. dations to scientists and researchers and, hence, conserved time. For our client, an IRS was created using a file-based data model. This approach delivers superior per- Key Challenges formance and execution time compared with The problems faced during development were a database-based model. Also, to sustain the compounded by the complexity of the algorithms freshness of the file and thereby return newer used, as well as the common problems of articles and resources with reduced latency, a software design. Apart from these developments, scheduled refresh module was used which runs issues unique to AI applications, such as the need cognizant 20-20 insights 4
  5. 5. for good quality input data, also arose. Common The next challenge is providing recommenda-algorithms such as Tanimoto and Pearson tions to a user who is not part of a clear group;similarity have several advantages, such as the these users can be called “grey sheep.” Theability to detect the quality and features of an challenge of giving recommendations to new oritem at the time of recommendation, especially unknown users is called thewhen a user rating on a scale of 1 to 5 is available. “grey sheep problem.” This Data clustering requiresAnother advantage is that these algorithms are problem mostly arises inuseful in domains where analyzing data is difficult Internet applications with multiple machines toor costly, such as gathering data about science large and diverse customer handle massive dataarticles where domain knowledge is required to bases that access the IRS sets and stores allrate an article. application. A solution to this problem is to look for similar resources that areThe first challenge in the development of our users based on similar demo- required to provideIRS is that the system was brand new and lacked graphic information or char- recommendations.user rating information. Huge amount of input acteristics. These users candata, but few users and no ratings is referred be technically called “neighbors.” To determineto as “Cold Start Problem.” A solution to this similarity, a “classification” machine-learningproblem was to use the TanimotoCoefficient11 mechanism can be used. Similar neighbors canalgorithm and binary dataset to feed the system be found by implementing different similaritywith the much-sought-after data during the algorithms or by checking similar user properties.initial use of the system. In the course of time, The probability of this problem is significantlyuser rating data will be obtained after which the lower in intranet applications. The classificationsimilarity algorithm and input data structure can algorithm trains the computer to classify similarbe changed. This initial dataset can be prepared users in a particular group and teaches the IRSby collecting data such as maximum hits of a to find the specific group when a user is new toparticular product or tracking visits of articles. the system.If a binary dataset is adopted then most suitablealgorithms are TanimotoCoefficientSimilarity or Mahout currently supports two types of classifierLogLikelihoodSimilarity12 algorithms, since these approaches: Naive Bayes classifier and Comple-are specially defined for binary data and use the mentary Naive Bayes classifier. Naive’s classifierJaccard13 coefficient formula. As the usage of the is the faster method for the problems such asrecommendation system grows, huge amounts of grey sheep recommendations. This classifierdata will be collected. It is advisable at this point includes two parts. First, it analyzes the existingto use clustering mechanisms to sustain per- documents or data and finds the commonformance. A solution to deal with massive data features among them. This method is also calledby means of data clustering is Hadoop.14 Data as preparing training examples. Secondly, clas-clustering requires multiple machines to handle sification is performed using the model thatmassive data sets and stores all resources that was created in the first step using the Bayesare required to provide recommendations. theorem15 to predict the category of a new item (see Figure 2).Diagram of Classification Technique Input Training Training Predictor Algorithms Documents/Items Examples Variable e.g. Bayes Based on Demographic Use User’s Property Properties of User to Classify New Predicted Items/ Predictor Model Estimated Category/ Documents Variable Variable Decision Similar Users are Detect Criteria to Classi ed in Predicted Property Classify the Users Speci c Group on which User Can Be Classi edFigure 2 cognizant 20-20 insights 5
  6. 6. Finally, sometimes providing recommendations to can be an important means of bolstering salesusers only on product ratings will not be accurate; and increasing information awareness amongproblems may arise when the recommended customers.items originate from another domain to whichthe user is not related. This problem is known as In this paper, we examined how recommendationa “blind recommendation” problem. This problem systems can be applied to the scientific researchis mostly prevalent in the post-production phase domain. The successful implementation of anof recommendation systems development. A IRS provided our client with flexibility in usingsolution to this problem is to use evaluators that pre-existing information resources. It solved thecontinuously verify and validate recommenda- problem of information overload by enablingtions with actual user details. scientists and researchers to use information resources more efficiently for their research.Conclusion Our IRS offers testimony that a recommendationOnline businesses are working very hard to system that provides customizable recommen-increase their customer bases by providing dations can enable online companies to performcompetitive prices, products and services. In business more effectively.this frantic effort, recommendation systemsFootnotes1 http://mahout.apache.org/2 https://cwiki.apache.org/MAHOUT/mahout-on-elastic-mapreduce.html as of August 20113 http://www.netflixprize.com/4 http://developer.ebay.com/DevZone/xml/docs/Reference/ebay/AddSellingManagerTemplate.html as of August 20115 http://www.practicalecommerce.com/articles/1942-10-Questions-on-Product-Recommendations as of August 20116 http://en.wikipedia.org/wiki/Collaborative_filtering7 http://www2007.org/papers/paper570.pdf8 http://www.manning.com/owen/Mahout_MEAP_CH01.pdf as of August 20119 http://blogoscoped.com/archive/2007-03-12-n67.html10 http://news.cnet.com/8301-27076_3-10358662-248.html11 http://en.wikipedia.org/wiki/Jaccard_index12 http://en.wikipedia.org/wiki/Likelihood-ratio_test13 http://en.wikipedia.org/wiki/Jaccard_index14 http://hadoop.apache.org/15 http://en.wikipedia.org/wiki/Naive_Bayes_classifierReferences:Mahout: http://www.ibm.com/developerworks/opensource/library/j-mahout/index.htmlhttp://ilk.uvt.nl/~toine/phd-thesis/phd-thesis.pdfGroupLens Research Project: http://www.grouplens.org/papers/pdf/ec-99.pdfhttp://en.wikipedia.org/wiki/Recommender_system cognizant 20-20 insights 6
  7. 7. About the AuthorsAnup Prakash Warade is a Cognizant Associate. He develops and supports multiple projects for keyclients in the manufacturing and logistics industry. His interests include designing and developing JavaEE Web applications. Arup holds a bachelor’s of engineering degree in electronics from the University ofPune in India. He can be reached at Anup.Warade@cognizant.com.Vignesh Murali Natarajan is a Cognizant Senior Associate. He is a technical trainer and a Java architectwho is currently managing several Java EE applications for a manufacturing client. Vignesh holds abachelor’s degree in computer science and engineering from the University of Madras. He can bereached at Vigneshmurali.Natarajan@cognizant.com.Siddharth Sharad Chandak is a Cognizant Project Manager overseeing numerous business-criticalprojects for clients in the manufacturing and logistics industry. His interests include architecting anddesigning enterprise software applications, software performance tuning and project management.Siddharth holds a master’s of science degree in computer science from Kansas State University. He canbe reached at Siddharth.Chandak@cognizant.com.About CognizantCognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out-sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered inTeaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industryand business process expertise, and a global, collaborative workforce that embodies the future of work. With over 50delivery centers worldwide and approximately 130,000 employees as of September 30, 2011, Cognizant is a member ofthe NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performingand fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters European Headquarters India Operations Headquarters 500 Frank W. Burr Blvd. 1 Kingdom Street #5/535, Old Mahabalipuram Road Teaneck, NJ 07666 USA Paddington Central Okkiyam Pettai, Thoraipakkam Phone: +1 201 801 0233 London W2 6BD Chennai, 600 096 India Fax: +1 201 801 0243 Phone: +44 (0) 20 7297 7600 Phone: +91 (0) 44 4209 6000 Toll Free: +1 888 937 3277 Fax: +44 (0) 20 7121 0102 Fax: +91 (0) 44 4209 6060 Email: inquiry@cognizant.com Email: infouk@cognizant.com Email: inquiryindia@cognizant.com© Copyright 2012, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein issubject to change without notice. All other trademarks mentioned herein are the property of their respective owners.