Data mining and its concepts

Data Mining
Prepared by
R. Abhinav Bharadwaj

Overview
 Introduction
 Explanation of Data Mining Techniques
 Advantages
 Applications
 Privacy

Data Mining
 What is Data Mining?
 “The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
 KDD – “Knowledge Discovery in Databases” (3)
 “Attempts to discover rules and patterns from data”
 Discover Rules  Make Predictions
 Areas of Use
 Internet – Discover needs of customers
 Economics – Predict stock prices
 Science – Predict environmental change
 Medicine – Match patients with similar problems 
cure

Example of Data Mining
 Credit Card Company wants to discover
information about clients from databases. Want to
find:
 Clients who respond to promotions in “Junk Mail”
 Clients that are likely to change to another
competitor
 Clients that are likely to not pay
 Services that clients use to try to promote
services affiliated with the Credit Card Company
 Anything else that may help the Company
provide/ promote services to help their clients
and ultimately make more money.

Data Mining & Data
Warehousing
 Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site.”
(Silberschatz)
 Collect data  Store in single repository
 Allows for easier query development as a single
repository can be queried.
 Data Mining:
 Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
 Knowledge is power.

Data Mining Techniques
 Classification
 Clustering
 Regression
 Association Rules

Classification
 Classification: Given a set of items that have several
classes, and given the past instances (training
instances) with their associated class, Classification
is the process of predicting the class of a new item.
 Therefore to classify the new item and identify to
which class it belongs
 Example: A bank wants to classify its Home Loan
Customers into groups according to their response to
bank advertisements. The bank might use the
classifications “Responds Rarely, Responds
Sometimes, Responds Frequently”.
 The bank will then attempt to find rules about the
customers that respond Frequently and Sometimes.
 The rules could be used to predict needs of potential
customers.

Technique for Classification
 Decision-Tree Classifiers
Job
Income
Job
Income Income
Carpenter
Engineer Doctor
Bad Good Bad Good Bad Good
<30K <40K <50K>50K >90K
>100K
Predicting credit risk of a person with the jobs specified.

Clustering
 “Clustering algorithms find groups of items
that are similar. … It divides a data set so that
records with similar content are in the same
group, and groups are as different as possible
from each other. ” (2)
 Example: Insurance company could use
clustering to group clients by their age,
location and types of insurance purchased.
 The categories are unspecified and this is
referred to as ‘unsupervised learning’

Clustering
 Group Data into Clusters
 Similar data is grouped in the same cluster
 Dissimilar data is grouped in the same cluster
 How is this achieved ?
 K-Nearest Neighbor
 A classification method that classifies a point
by calculating the distances between the
point and points in the training data set. Then
it assigns the point to the class that is most
common among its k-nearest neighbors
(where k is an integer).(2)
 Hierarchical
 Group data into t-trees

Regression
 “Regression deals with the prediction of a value,
rather than a class.” (1, P747)
 Example: Find out if there is a relationship
between smoking patients and cancer related
illness.
 Given values: X1, X2... Xn
 Objective predict variable Y
 One way is to predict coefficients a0, a1, a2
 Y = a0 + a1X1 + a2X2 + … anXn
 Linear Regression

Regression
 Example graph:
 Line of Best Fit
 Curve Fitting

Association Rules
 “An association algorithm creates rules that
describe how often events have occurred
together.” (2)
 Example: When a customer buys a hammer,
then 90% of the time they will buy nails.

Association Rules
 Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”(1, p748)
 Example:
 People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
 People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support
 Situations where there is high support for the
antecedent are worth careful attention
 E.g. Hotdog sausages should be placed in near hotdog buns
in supermarkets if there is also high confidence.

Association Rules
 Confidence: “is a measure of how often the consequent
is true when the antecedent is true.” (1, p748)
 Example:
 90% of Hotdog bun purchases are accompanied by hotdog
sausages.
 High confidence is meaningful as we can derive rules.
 Hotdog bun Hotdog sausage
 2 rules may have different confidence levels and
have the same support.
 E.g. Hotdog sausage  Hotdog bun may have a
much lower confidence than Hotdog bun  Hotdog
sausage yet they both can have the same support.

Advantages of Data Mining
 Provides new knowledge from existing data
 Public databases
 Government sources
 Company Databases
 Old data can be used to develop new knowledge
 New knowledge can be used to improve services or
products
 Improvements lead to:
 Bigger profits
 More efficient service

Uses of Data Mining
 Sales/ Marketing
 Diversify target market
 Identify clients needs to increase response rates
 Risk Assessment
 Identify Customers that pose high credit risk
 Fraud Detection
 Identify people misusing the system. E.g. People
who have two Social Security Numbers
 Customer Care
 Identify customers likely to change providers
 Identify customer needs

Applications of Data Mining
(4)
Source IDC 1998

Privacy Concerns
 Effective Data Mining requires large sources of data
 To achieve a wide spectrum of data, link multiple data
sources
 Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
 Shopping History
 Credit History
 Bank History
 Employment History
 The users life story can be painted from the collected
data

References
1. Silberschatz, Korth, Sudarshan, “Database System
Concepts”, 5th
Edition, Mc Graw Hill, 2005
2. http://www.twocrows.com/glossary.htm, “Two Crows,
Data Mining Glossary”
3. http://en.wikipedia.org/wiki/Data_mining, “Wikipedia”
4. http://phoenix.phys.clemson.edu/tutorials/excel/regres
sion.html
5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf

Data mining and its concepts

More Related Content

What's hot

Viewers also liked

Similar to Data mining and its concepts

Recently uploaded

Data mining and its concepts