2. Introduction
• Discovery of models for data
• Example if the data is set of numbers then we
assume that the data comes from Gaussian and
model the parameters to define it completely
• Recognize meaningful patterns in data -> data
mining
Predict outcome from known patterns -> ML
3. Data Mining Techniques
• Classification
• Predicting the class of new item given set of items with
several classes and past instances
• Example loan approval based on decision tree classifiers
Job
Engineer
Carpenter
Income
<30K
Bad
>50K
Good
Income
<40K
Bad
>90K
Good
Doctor
Income
>100K
<50K
Bad
Good
4. • Clustering
• Clustering algorithms find group of items that are similar
• Basically divides a dataset so that records with similar
content are in the same group and group are as different as
possible from each other
• K-Nearest Neighbor – a classification method that clasifies
based on calculating the distances between point and
other points in the training dataset
• Example Car Sales
5. • Regression
• Deals with prediction of value rather than class
• Given x1, x2, x3….. Predict Y
• Use Linear regression and predict variables a0, a1, a2… in
Y=a0+a1x1+a2x2…..
• Use Line fitting, Curve fitting methods
• Example find a relationship between smoking patients and
cancer related illness
6. • Association Rules
• These algorithms create rules that describe how often
events have occurred together
• Example when a customer buys a hammer then 90% of the
time they buy nails
• Spam classification based on conditional probability
• Support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of the
rule
• Confidence is the measure of how often the consequent is
true when the antecedent is true
• Outlier Analysis
• Most Data mining methods discard outliers as noise or
exceptions
• However in some applications such as fraud detection,
these rare events can be more interesting
7. Knowledge Discovery Process
• Data Collection
• Data Cleaning
• Data Integration
• Data selection
• Data transformation
• Data Mining
• Evaluation
• Knowledge presentation
8. Applications of Data Mining
• Marketing
• Manufacturing
• Analysis of consumer behavior
• Optimization of resources
• Advertising campaigns
• Optimization of manufacturing
processes
• Targeted mailings
• Segmentation of customers,
stores, or products
• Finance
• Product design based on
customer requirements
• Health Care
• Creditworthiness of clients
• Discovering patterns in X-ray
images
• Performance analysis of finance
investments
• Analyzing side effects of drugs
• Fraud detection
• Effectiveness of treatments
9. Privacy Concerns
• Effective Data Mining requires large sources of data
• To achieve a wide spectrum of data, link multiple data
sources
• Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
• Shopping History
• Credit History
• Bank History
• Employment History
• The users life story can be painted from the collected data
10. Recommendation systems
• Definition – RS are subclass of information filtering
systems that seek to predict the rating or preference
that user would give to an item
• Enhance user experience by assisting user in finding
information and reduce search and navigation time
• Increase productivity and credibility
• Decrease Long tail phenomenon
• Types of RS
• Content based RS
• Collaborative filtering RS
• Hybrid RS
11. • Content based RS
•
Recommend items similar to those users preferred in
the past
•
User profiling is the key
•
Items/content usually denoted by keywords
• Limitations
• Not all contents well represented by keywords (e.g Images)
• unrated items not shown
• Users with thousands of purchases is a problem
• Example: Pandora uses properties of a song in the Music
Genome Project to play similar songs
12. • Collaborative Filtering method
• Uses other users rating for recommendation
• Key is to find users/user groups whose interests match with the
current user
• More users, more ratings: better results
• Limitations
• Cold Start problem
• Large computation power required
• Sparsity
• Example: Last.fm or Spotify recommend songs based on
user listening history and comparing with other users.
Facebook, LinkedIn use collaborative filtering to
recommend new friends and connections
13. • Hybrid RS
• There are some cases where combining content based and
collaborative filtering are more effective
• Can overcome the sparsity and cold start problem
• Netflix Prize: offered a prize of 1 million to team that could
increase the Netflix rating by 10%. The competition
spanned from 2006-2009 won by BellKor's Pragmatic
Chaos who used ensemble of 107 algorithms for single
prediction!
• Amazon item to item collaboration
• Compute similarity between item pairs
• Combine the similar items into recommendation list
• Vector corresponds to an item, and directions correspond
to customers who have purchased them
• Similar items table built offline