Data Mining
Prepared by
R. Abhinav Bharadwaj
Overview
 Introduction
 Explanation of Data Mining Techniques
 Advantages
 Applications
 Privacy
Data Mining
 What is Data Mining?
 “The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
 KDD – “Knowledge Discovery in Databases” (3)
 “Attempts to discover rules and patterns from data”
 Discover Rules  Make Predictions
 Areas of Use
 Internet – Discover needs of customers
 Economics – Predict stock prices
 Science – Predict environmental change
 Medicine – Match patients with similar problems 
cure
Example of Data Mining
 Credit Card Company wants to discover
information about clients from databases. Want to
find:
 Clients who respond to promotions in “Junk Mail”
 Clients that are likely to change to another
competitor
 Clients that are likely to not pay
 Services that clients use to try to promote
services affiliated with the Credit Card Company
 Anything else that may help the Company
provide/ promote services to help their clients
and ultimately make more money.
Data Mining & Data
Warehousing
 Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site.”
(Silberschatz)
 Collect data  Store in single repository
 Allows for easier query development as a single
repository can be queried.
 Data Mining:
 Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
 Knowledge is power.
Discovery of Knowledge
Data Mining Techniques
 Classification
 Clustering
 Regression
 Association Rules
Classification
 Classification: Given a set of items that have several
classes, and given the past instances (training
instances) with their associated class, Classification
is the process of predicting the class of a new item.
 Therefore to classify the new item and identify to
which class it belongs
 Example: A bank wants to classify its Home Loan
Customers into groups according to their response to
bank advertisements. The bank might use the
classifications “Responds Rarely, Responds
Sometimes, Responds Frequently”.
 The bank will then attempt to find rules about the
customers that respond Frequently and Sometimes.
 The rules could be used to predict needs of potential
customers.
Technique for Classification
 Decision-Tree Classifiers
Job
Income
Job
Income Income
Carpenter
Engineer Doctor
Bad Good Bad Good Bad Good
<30K <40K <50K>50K >90K
>100K
Predicting credit risk of a person with the jobs specified.
Clustering
 “Clustering algorithms find groups of items
that are similar. … It divides a data set so that
records with similar content are in the same
group, and groups are as different as possible
from each other. ” (2)
 Example: Insurance company could use
clustering to group clients by their age,
location and types of insurance purchased.
 The categories are unspecified and this is
referred to as ‘unsupervised learning’
Clustering
 Group Data into Clusters
 Similar data is grouped in the same cluster
 Dissimilar data is grouped in the same cluster
 How is this achieved ?
 K-Nearest Neighbor
 A classification method that classifies a point
by calculating the distances between the
point and points in the training data set. Then
it assigns the point to the class that is most
common among its k-nearest neighbors
(where k is an integer).(2)
 Hierarchical
 Group data into t-trees
Regression
 “Regression deals with the prediction of a value,
rather than a class.” (1, P747)
 Example: Find out if there is a relationship
between smoking patients and cancer related
illness.
 Given values: X1, X2... Xn
 Objective predict variable Y
 One way is to predict coefficients a0, a1, a2
 Y = a0 + a1X1 + a2X2 + … anXn
 Linear Regression
Regression
 Example graph:
 Line of Best Fit
 Curve Fitting
Association Rules
 “An association algorithm creates rules that
describe how often events have occurred
together.” (2)
 Example: When a customer buys a hammer,
then 90% of the time they will buy nails.
Association Rules
 Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”(1, p748)
 Example:
 People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
 People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support
 Situations where there is high support for the
antecedent are worth careful attention
 E.g. Hotdog sausages should be placed in near hotdog buns
in supermarkets if there is also high confidence.
Association Rules
 Confidence: “is a measure of how often the consequent
is true when the antecedent is true.” (1, p748)
 Example:
 90% of Hotdog bun purchases are accompanied by hotdog
sausages.
 High confidence is meaningful as we can derive rules.
 Hotdog bun Hotdog sausage
 2 rules may have different confidence levels and
have the same support.
 E.g. Hotdog sausage  Hotdog bun may have a
much lower confidence than Hotdog bun  Hotdog
sausage yet they both can have the same support.
Advantages of Data Mining
 Provides new knowledge from existing data
 Public databases
 Government sources
 Company Databases
 Old data can be used to develop new knowledge
 New knowledge can be used to improve services or
products
 Improvements lead to:
 Bigger profits
 More efficient service
Uses of Data Mining
 Sales/ Marketing
 Diversify target market
 Identify clients needs to increase response rates
 Risk Assessment
 Identify Customers that pose high credit risk
 Fraud Detection
 Identify people misusing the system. E.g. People
who have two Social Security Numbers
 Customer Care
 Identify customers likely to change providers
 Identify customer needs
Applications of Data Mining
(4)
Source IDC 1998
Privacy Concerns
 Effective Data Mining requires large sources of data
 To achieve a wide spectrum of data, link multiple data
sources
 Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
 Shopping History
 Credit History
 Bank History
 Employment History
 The users life story can be painted from the collected
data
References
1. Silberschatz, Korth, Sudarshan, “Database System
Concepts”, 5th
Edition, Mc Graw Hill, 2005
2. http://www.twocrows.com/glossary.htm, “Two Crows,
Data Mining Glossary”
3. http://en.wikipedia.org/wiki/Data_mining, “Wikipedia”
4. http://phoenix.phys.clemson.edu/tutorials/excel/regres
sion.html
5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf

Data mining and its concepts

  • 1.
    Data Mining Prepared by R.Abhinav Bharadwaj
  • 2.
    Overview  Introduction  Explanationof Data Mining Techniques  Advantages  Applications  Privacy
  • 3.
    Data Mining  Whatis Data Mining?  “The process of semi automatically analyzing large databases to find useful patterns” (Silberschatz)  KDD – “Knowledge Discovery in Databases” (3)  “Attempts to discover rules and patterns from data”  Discover Rules  Make Predictions  Areas of Use  Internet – Discover needs of customers  Economics – Predict stock prices  Science – Predict environmental change  Medicine – Match patients with similar problems  cure
  • 4.
    Example of DataMining  Credit Card Company wants to discover information about clients from databases. Want to find:  Clients who respond to promotions in “Junk Mail”  Clients that are likely to change to another competitor  Clients that are likely to not pay  Services that clients use to try to promote services affiliated with the Credit Card Company  Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money.
  • 5.
    Data Mining &Data Warehousing  Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz)  Collect data  Store in single repository  Allows for easier query development as a single repository can be queried.  Data Mining:  Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge.  Knowledge is power.
  • 6.
  • 7.
    Data Mining Techniques Classification  Clustering  Regression  Association Rules
  • 8.
    Classification  Classification: Givena set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item.  Therefore to classify the new item and identify to which class it belongs  Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”.  The bank will then attempt to find rules about the customers that respond Frequently and Sometimes.  The rules could be used to predict needs of potential customers.
  • 9.
    Technique for Classification Decision-Tree Classifiers Job Income Job Income Income Carpenter Engineer Doctor Bad Good Bad Good Bad Good <30K <40K <50K>50K >90K >100K Predicting credit risk of a person with the jobs specified.
  • 10.
    Clustering  “Clustering algorithmsfind groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ” (2)  Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.  The categories are unspecified and this is referred to as ‘unsupervised learning’
  • 11.
    Clustering  Group Datainto Clusters  Similar data is grouped in the same cluster  Dissimilar data is grouped in the same cluster  How is this achieved ?  K-Nearest Neighbor  A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).(2)  Hierarchical  Group data into t-trees
  • 12.
    Regression  “Regression dealswith the prediction of a value, rather than a class.” (1, P747)  Example: Find out if there is a relationship between smoking patients and cancer related illness.  Given values: X1, X2... Xn  Objective predict variable Y  One way is to predict coefficients a0, a1, a2  Y = a0 + a1X1 + a2X2 + … anXn  Linear Regression
  • 13.
    Regression  Example graph: Line of Best Fit  Curve Fitting
  • 14.
    Association Rules  “Anassociation algorithm creates rules that describe how often events have occurred together.” (2)  Example: When a customer buys a hammer, then 90% of the time they will buy nails.
  • 15.
    Association Rules  Support:“is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule”(1, p748)  Example:  People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support  People who buy hotdog buns buy hangers in 0.005% of cases. = Low support  Situations where there is high support for the antecedent are worth careful attention  E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.
  • 16.
    Association Rules  Confidence:“is a measure of how often the consequent is true when the antecedent is true.” (1, p748)  Example:  90% of Hotdog bun purchases are accompanied by hotdog sausages.  High confidence is meaningful as we can derive rules.  Hotdog bun Hotdog sausage  2 rules may have different confidence levels and have the same support.  E.g. Hotdog sausage  Hotdog bun may have a much lower confidence than Hotdog bun  Hotdog sausage yet they both can have the same support.
  • 17.
    Advantages of DataMining  Provides new knowledge from existing data  Public databases  Government sources  Company Databases  Old data can be used to develop new knowledge  New knowledge can be used to improve services or products  Improvements lead to:  Bigger profits  More efficient service
  • 18.
    Uses of DataMining  Sales/ Marketing  Diversify target market  Identify clients needs to increase response rates  Risk Assessment  Identify Customers that pose high credit risk  Fraud Detection  Identify people misusing the system. E.g. People who have two Social Security Numbers  Customer Care  Identify customers likely to change providers  Identify customer needs
  • 19.
    Applications of DataMining (4) Source IDC 1998
  • 20.
    Privacy Concerns  EffectiveData Mining requires large sources of data  To achieve a wide spectrum of data, link multiple data sources  Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked:  Shopping History  Credit History  Bank History  Employment History  The users life story can be painted from the collected data
  • 21.
    References 1. Silberschatz, Korth,Sudarshan, “Database System Concepts”, 5th Edition, Mc Graw Hill, 2005 2. http://www.twocrows.com/glossary.htm, “Two Crows, Data Mining Glossary” 3. http://en.wikipedia.org/wiki/Data_mining, “Wikipedia” 4. http://phoenix.phys.clemson.edu/tutorials/excel/regres sion.html 5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf