Data mining and its concepts

Data Mining
Prepared by
R. Abhinav Bharadwaj

Overview
 Introduction
 Explanation of Data Mining Techniques
 Advantages
 Applications
 Privacy

Data Mining
 What is Data Mining?
 “The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
 KDD – “Knowledge Discovery in Databases” (3)
 “Attempts to discover rules and patterns from data”
 Discover Rules  Make Predictions
 Areas of Use
 Internet – Discover needs of customers
 Economics – Predict stock prices
 Science – Predict environmental change
 Medicine – Match patients with similar problems 
cure

Example of Data Mining
 Credit Card Company wants to discover
information about clients from databases. Want to
find:
 Clients who respond to promotions in “Junk Mail”
 Clients that are likely to change to another
competitor
 Clients that are likely to not pay
 Services that clients use to try to promote
services affiliated with the Credit Card Company
 Anything else that may help the Company
provide/ promote services to help their clients
and ultimately make more money.

Data Mining & Data
Warehousing
 Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site.”
(Silberschatz)
 Collect data  Store in single repository
 Allows for easier query development as a single
repository can be queried.
 Data Mining:
 Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
 Knowledge is power.

Data Mining Techniques
 Classification
 Clustering
 Regression
 Association Rules

Classification
 Classification: Given a set of items that have several
classes, and given the past instances (training
instances) with their associated class, Classification
is the process of predicting the class of a new item.
 Therefore to classify the new item and identify to
which class it belongs
 Example: A bank wants to classify its Home Loan
Customers into groups according to their response to
bank advertisements. The bank might use the
classifications “Responds Rarely, Responds
Sometimes, Responds Frequently”.
 The bank will then attempt to find rules about the
customers that respond Frequently and Sometimes.
 The rules could be used to predict needs of potential
customers.

Technique for Classification
 Decision-Tree Classifiers
Job
Income
Job
Income Income
Carpenter
Engineer Doctor
Bad Good Bad Good Bad Good
<30K <40K <50K>50K >90K
>100K
Predicting credit risk of a person with the jobs specified.

Clustering
 “Clustering algorithms find groups of items
that are similar. … It divides a data set so that
records with similar content are in the same
group, and groups are as different as possible
from each other. ” (2)
 Example: Insurance company could use
clustering to group clients by their age,
location and types of insurance purchased.
 The categories are unspecified and this is
referred to as ‘unsupervised learning’

Clustering
 Group Data into Clusters
 Similar data is grouped in the same cluster
 Dissimilar data is grouped in the same cluster
 How is this achieved ?
 K-Nearest Neighbor
 A classification method that classifies a point
by calculating the distances between the
point and points in the training data set. Then
it assigns the point to the class that is most
common among its k-nearest neighbors
(where k is an integer).(2)
 Hierarchical
 Group data into t-trees

Regression
 “Regression deals with the prediction of a value,
rather than a class.” (1, P747)
 Example: Find out if there is a relationship
between smoking patients and cancer related
illness.
 Given values: X1, X2... Xn
 Objective predict variable Y
 One way is to predict coefficients a0, a1, a2
 Y = a0 + a1X1 + a2X2 + … anXn
 Linear Regression

Regression
 Example graph:
 Line of Best Fit
 Curve Fitting

Association Rules
 “An association algorithm creates rules that
describe how often events have occurred
together.” (2)
 Example: When a customer buys a hammer,
then 90% of the time they will buy nails.

Association Rules
 Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”(1, p748)
 Example:
 People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
 People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support
 Situations where there is high support for the
antecedent are worth careful attention
 E.g. Hotdog sausages should be placed in near hotdog buns
in supermarkets if there is also high confidence.

Association Rules
 Confidence: “is a measure of how often the consequent
is true when the antecedent is true.” (1, p748)
 Example:
 90% of Hotdog bun purchases are accompanied by hotdog
sausages.
 High confidence is meaningful as we can derive rules.
 Hotdog bun Hotdog sausage
 2 rules may have different confidence levels and
have the same support.
 E.g. Hotdog sausage  Hotdog bun may have a
much lower confidence than Hotdog bun  Hotdog
sausage yet they both can have the same support.

Advantages of Data Mining
 Provides new knowledge from existing data
 Public databases
 Government sources
 Company Databases
 Old data can be used to develop new knowledge
 New knowledge can be used to improve services or
products
 Improvements lead to:
 Bigger profits
 More efficient service

Uses of Data Mining
 Sales/ Marketing
 Diversify target market
 Identify clients needs to increase response rates
 Risk Assessment
 Identify Customers that pose high credit risk
 Fraud Detection
 Identify people misusing the system. E.g. People
who have two Social Security Numbers
 Customer Care
 Identify customers likely to change providers
 Identify customer needs

Applications of Data Mining
(4)
Source IDC 1998

Privacy Concerns
 Effective Data Mining requires large sources of data
 To achieve a wide spectrum of data, link multiple data
sources
 Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
 Shopping History
 Credit History
 Bank History
 Employment History
 The users life story can be painted from the collected
data

References
1. Silberschatz, Korth, Sudarshan, “Database System
Concepts”, 5th
Edition, Mc Graw Hill, 2005
2. http://www.twocrows.com/glossary.htm, “Two Crows,
Data Mining Glossary”
3. http://en.wikipedia.org/wiki/Data_mining, “Wikipedia”
4. http://phoenix.phys.clemson.edu/tutorials/excel/regres
sion.html
5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf

Data mining and its concepts

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Data mining and its concepts

Similar to Data mining and its concepts (20)

Recently uploaded

Recently uploaded (20)

Data mining and its concepts