Data Mining and
Recommendation
Systems
- S A L IL NAVG IR E
Introduction
• Discovery of models for data

• Example if the data is set of numbers then we
assume that the data comes from Gaussian and
model the parameters to define it completely
• Recognize meaningful patterns in data -> data
mining
Predict outcome from known patterns -> ML
Data Mining Techniques
• Classification
• Predicting the class of new item given set of items with
several classes and past instances
• Example loan approval based on decision tree classifiers
Job
Engineer

Carpenter

Income
<30K

Bad

>50K

Good

Income
<40K

Bad

>90K

Good

Doctor

Income
>100K

<50K

Bad

Good
• Clustering
• Clustering algorithms find group of items that are similar
• Basically divides a dataset so that records with similar
content are in the same group and group are as different as
possible from each other
• K-Nearest Neighbor – a classification method that clasifies
based on calculating the distances between point and
other points in the training dataset
• Example Car Sales
• Regression
• Deals with prediction of value rather than class
• Given x1, x2, x3….. Predict Y
• Use Linear regression and predict variables a0, a1, a2… in
Y=a0+a1x1+a2x2…..
• Use Line fitting, Curve fitting methods
• Example find a relationship between smoking patients and
cancer related illness
• Association Rules
• These algorithms create rules that describe how often
events have occurred together
• Example when a customer buys a hammer then 90% of the
time they buy nails

• Spam classification based on conditional probability
• Support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of the
rule
• Confidence is the measure of how often the consequent is
true when the antecedent is true

• Outlier Analysis
• Most Data mining methods discard outliers as noise or
exceptions

• However in some applications such as fraud detection,
these rare events can be more interesting
Knowledge Discovery Process
• Data Collection

• Data Cleaning
• Data Integration
• Data selection

• Data transformation
• Data Mining
• Evaluation

• Knowledge presentation
Applications of Data Mining
• Marketing

• Manufacturing

• Analysis of consumer behavior

• Optimization of resources

• Advertising campaigns

• Optimization of manufacturing
processes

• Targeted mailings
• Segmentation of customers,
stores, or products

• Finance

• Product design based on
customer requirements

• Health Care

• Creditworthiness of clients

• Discovering patterns in X-ray
images

• Performance analysis of finance
investments

• Analyzing side effects of drugs

• Fraud detection

• Effectiveness of treatments
Privacy Concerns
• Effective Data Mining requires large sources of data

• To achieve a wide spectrum of data, link multiple data
sources
• Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
• Shopping History
• Credit History
• Bank History
• Employment History

• The users life story can be painted from the collected data
Recommendation systems
• Definition – RS are subclass of information filtering
systems that seek to predict the rating or preference
that user would give to an item
• Enhance user experience by assisting user in finding
information and reduce search and navigation time
• Increase productivity and credibility

• Decrease Long tail phenomenon
• Types of RS
• Content based RS
• Collaborative filtering RS
• Hybrid RS
• Content based RS
•

Recommend items similar to those users preferred in
the past

•

User profiling is the key

•

Items/content usually denoted by keywords

• Limitations
• Not all contents well represented by keywords (e.g Images)
• unrated items not shown
• Users with thousands of purchases is a problem

• Example: Pandora uses properties of a song in the Music
Genome Project to play similar songs
• Collaborative Filtering method
• Uses other users rating for recommendation
• Key is to find users/user groups whose interests match with the
current user
• More users, more ratings: better results

• Limitations
• Cold Start problem
• Large computation power required
• Sparsity

• Example: Last.fm or Spotify recommend songs based on
user listening history and comparing with other users.
Facebook, LinkedIn use collaborative filtering to
recommend new friends and connections
• Hybrid RS
• There are some cases where combining content based and
collaborative filtering are more effective
• Can overcome the sparsity and cold start problem
• Netflix Prize: offered a prize of 1 million to team that could
increase the Netflix rating by 10%. The competition
spanned from 2006-2009 won by BellKor's Pragmatic
Chaos who used ensemble of 107 algorithms for single
prediction!

• Amazon item to item collaboration
• Compute similarity between item pairs
• Combine the similar items into recommendation list
• Vector corresponds to an item, and directions correspond
to customers who have purchased them
• Similar items table built offline
• Measuring similarity
Examples
• E-Commerce: Amazon.com, Ebay, Etsy.

• Music: Spotify, Pandora.
• Movie: Nettfilx.com, IMDB.
• News: Digg, Summly.

• Social Networks: LinkedIn, Facebook, Quora, YouTube
• Apps: Playstore, Cover

Data Mining and Recommendation Systems

  • 1.
  • 2.
    Introduction • Discovery ofmodels for data • Example if the data is set of numbers then we assume that the data comes from Gaussian and model the parameters to define it completely • Recognize meaningful patterns in data -> data mining Predict outcome from known patterns -> ML
  • 3.
    Data Mining Techniques •Classification • Predicting the class of new item given set of items with several classes and past instances • Example loan approval based on decision tree classifiers Job Engineer Carpenter Income <30K Bad >50K Good Income <40K Bad >90K Good Doctor Income >100K <50K Bad Good
  • 4.
    • Clustering • Clusteringalgorithms find group of items that are similar • Basically divides a dataset so that records with similar content are in the same group and group are as different as possible from each other • K-Nearest Neighbor – a classification method that clasifies based on calculating the distances between point and other points in the training dataset • Example Car Sales
  • 5.
    • Regression • Dealswith prediction of value rather than class • Given x1, x2, x3….. Predict Y • Use Linear regression and predict variables a0, a1, a2… in Y=a0+a1x1+a2x2….. • Use Line fitting, Curve fitting methods • Example find a relationship between smoking patients and cancer related illness
  • 6.
    • Association Rules •These algorithms create rules that describe how often events have occurred together • Example when a customer buys a hammer then 90% of the time they buy nails • Spam classification based on conditional probability • Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule • Confidence is the measure of how often the consequent is true when the antecedent is true • Outlier Analysis • Most Data mining methods discard outliers as noise or exceptions • However in some applications such as fraud detection, these rare events can be more interesting
  • 7.
    Knowledge Discovery Process •Data Collection • Data Cleaning • Data Integration • Data selection • Data transformation • Data Mining • Evaluation • Knowledge presentation
  • 8.
    Applications of DataMining • Marketing • Manufacturing • Analysis of consumer behavior • Optimization of resources • Advertising campaigns • Optimization of manufacturing processes • Targeted mailings • Segmentation of customers, stores, or products • Finance • Product design based on customer requirements • Health Care • Creditworthiness of clients • Discovering patterns in X-ray images • Performance analysis of finance investments • Analyzing side effects of drugs • Fraud detection • Effectiveness of treatments
  • 9.
    Privacy Concerns • EffectiveData Mining requires large sources of data • To achieve a wide spectrum of data, link multiple data sources • Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked: • Shopping History • Credit History • Bank History • Employment History • The users life story can be painted from the collected data
  • 10.
    Recommendation systems • Definition– RS are subclass of information filtering systems that seek to predict the rating or preference that user would give to an item • Enhance user experience by assisting user in finding information and reduce search and navigation time • Increase productivity and credibility • Decrease Long tail phenomenon • Types of RS • Content based RS • Collaborative filtering RS • Hybrid RS
  • 11.
    • Content basedRS • Recommend items similar to those users preferred in the past • User profiling is the key • Items/content usually denoted by keywords • Limitations • Not all contents well represented by keywords (e.g Images) • unrated items not shown • Users with thousands of purchases is a problem • Example: Pandora uses properties of a song in the Music Genome Project to play similar songs
  • 12.
    • Collaborative Filteringmethod • Uses other users rating for recommendation • Key is to find users/user groups whose interests match with the current user • More users, more ratings: better results • Limitations • Cold Start problem • Large computation power required • Sparsity • Example: Last.fm or Spotify recommend songs based on user listening history and comparing with other users. Facebook, LinkedIn use collaborative filtering to recommend new friends and connections
  • 13.
    • Hybrid RS •There are some cases where combining content based and collaborative filtering are more effective • Can overcome the sparsity and cold start problem • Netflix Prize: offered a prize of 1 million to team that could increase the Netflix rating by 10%. The competition spanned from 2006-2009 won by BellKor's Pragmatic Chaos who used ensemble of 107 algorithms for single prediction! • Amazon item to item collaboration • Compute similarity between item pairs • Combine the similar items into recommendation list • Vector corresponds to an item, and directions correspond to customers who have purchased them • Similar items table built offline
  • 14.
  • 15.
    Examples • E-Commerce: Amazon.com,Ebay, Etsy. • Music: Spotify, Pandora. • Movie: Nettfilx.com, IMDB. • News: Digg, Summly. • Social Networks: LinkedIn, Facebook, Quora, YouTube • Apps: Playstore, Cover