Data mining uses data warehouse to take decisions. Data warehouse is to support decision making.
Data mining can be applied to operational database with individual transaction.
Data mining helps in extracting meaningful new patterns.
Data mining applications should be considered during the design of a data warehouse. The successful use of database mining applications depends on the construction of data warehouse.
5.
Define Data Mining
Data mining is sorting through data to identify patterns and establish relationships.
Data mining parameters include:
Association - looking for patterns where one event is connected to another event
Sequence or path analysis - looking for patterns where one event leads to another later event
Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok)
Clustering - finding and visually documenting groups of facts not previously known
6.
Part 2: Association Rules
7.
Association Rules
Association rules between Set of items in large database
8.
Why Association Rules? Bread ,milk Milk ,sugar Pen ,ink
9.
The general form of association rule is
X Y
x set of items {x1,x2,….xn}
y Set of items {y1,y2,y3…yn}
The above rule can be stated as database tuples that satisfy the condition in x are also likely to satisfy the condition in y.
10.
Consider the Purchase Table
Retail shops are often interested in association between different items that people buy. If we refer the table given above it is clear that
People who buy pen also buys ink
People who buys bread also milk.
11.
Association rules measures
Support
Confidence
12.
Support
This is the measure of percentage of transaction that contains the union all the items in the LHS and RHS.
Consider the rule PEN INK has a support of 75%. That is the items in LHS U RHS occur in 75% of transactions and a higher support.
13.
Confidence
Confidence is the measure of percentage of transactions that include the items in RHS.
Confidence is a measure of how often the rule is true.
bread Milk
Confidence of 80% of the purchases that include bread also milk.
14.
Part 3: classification Classification rules Decision trees Mathematical formula Neural network
15.
Some basic operations
Predictive:
Regression
Classification
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
16.
Classification
Given old data about customers and payments, predict new applicant’s loan eligibility.
Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
17.
Classification
Classification is a data mining (machine learning) technique used to predict group membership for data instances.
18.
Why Data Mining
Credit ratings/targeted marketing :
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?
Customer relationship management :
Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :
19.
Classification
Classification is defined as a process of finding a set of functions that describe and distinguish data classes.
Training Data Classification algorithm Classification Rules If age=“31 …. 40” And income=high Then rating = good. Name Age Income Rating abc 20 low fair xyz 31…40 Medium Good mny 40…50 High Excellent
20.
classification
This function we can find out the classes of the objects whose class labels are not known based on a set of training data.
A training data is a data whose class label is known.
The following are the different forms of classification
Classification Rules
Decision trees
Mathematical formula
Neural network
21.
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
Regression: (linear or any other polynomial)
a*x1 + b*x2 + c = Ci.
Decision tree classifier: divide decision space into piecewise constant regions.
Neural networks: partition by non-linear boundaries
22.
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees Salary < 1 M Prof = teacher Age < 30 Good Bad Bad Good
23.
Pros and Cons of decision trees
Cons
Cannot handle complicated relationship between features
simple decision boundaries
problems with lots of missing data
Pros
Reasonable training time
Fast application
Easy to interpret
Easy to implement
Can handle large number of features
24.
Neural network
Set of nodes connected by directed weighted edges
Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3 Basic NN unit A more typical NN
25.
Pros and Cons of Neural Network
Cons
Slow training time
Hard to interpret
Hard to implement: trial and error for choosing number of nodes
Pros
Can learn more complicated class boundaries
Fast application
Can handle large number of features
Conclusion: Use neural nets only if decision trees/NN fail. classification
26.
Part 4:Clustering Partitioning clustering algorithm Hierarchical clustering algorithm
27.
Clustering
Unsupervised learning when old data with class labels not available e.g. when introducing a new product.
Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
Key requirement: Need a good measure of similarity between instances.
28.
clustering
29.
Similarity
30.
Prevalent Interesting
Analysts already know about prevalent rules
Interesting rules are those that deviate from prior expectation
Mining’s payoff is in finding surprising phenomenon
1995 Milk and cereal sell together! Milk and cereal sell together! 1998 Zzzz...
31.
Clustering Algorithm
Partition clustering Algorithm
Hierarchical clustering algorithm
32.
Partition clustering Algorithm
Partition clustering algorithm generates a tree of clusters.
The number of cluster k is given by the user
33.
Hierarchical clustering algorithm
Hierarchical clustering algorithm generates a tree of clusters.
That is in the first step each cluster consists of single record.
In the second step,two cluster are grouped together
In the final step there is a single partition
34.
Part 6: Approaches to data mining problems Discovery of sequential Discovery of patterns in time series Discovery of classification rules Regression
35.
Discovery of sequential patterns Suppose a customer visit the shop three times and purchase the following sequence of item sets. { milk, bread, juice } { bread, eggs } { cookies, milk, coffee } The problem of discovering sequential patterns is to find all subsequences from the given sets of sequences that have a user defined minimum support. Trans_id Time Item_Purchased 101 6.35 Milk, bread, juice 792 7.38 Milk, juice 1130 8.05 Milk, eggs 1735 8.40 Bread, cookies ,coffee
36.
Discovery of patterns in time series
Time series are sequence of events having a fixed type of transaction.
The period during which the stock is raised steady for n days.
The longest period over which the stock and a change of not more than 1% over last closing price.
The quarter of a year during which the stock had the most percentage gain or loss.
37.
Discovery of classification rules
Classification is a process of defining a function that classifies a given object into many possible classes.
38.
Example
A bank wishes to classify its loan applicants into two groups or classes.
A group who are loan worthy(eligible)
Another group who are not worthy(not eligible)
To do the above classification, the bank can use the classification rule given below
If monthly income greater than 30,000 then
they are worthy
Else not worthy
39.
Regression
Regression is defined as a function over variables which gives a target class variable.
40.
Example
Labtest(Patient id,test1,test2,….testn)
This contain values of n test for one patient
The target variable that wish to predict is p, the probability of survival of the patient.