1. DATA CLASSIFICATION
Priyabrata satapathy
SIC NO.- MCS12121
Roll No-05
3/16/2013 Data Classification 1
2. Project Planning
Introduction to Data Mining.
Data Warehouse.
Data Preprocessing.
Data Cleaning.
Classification Techniques.
Problem Identification.
3/16/2013 Data Classification 2
3. Contents
1. Introduction to Data Classification.
2. applications of data Classification.
3. Steps of Data Classification.
4. Decision Tree Induction.
5. Attributes of Decision Tree.
6. Algorithm for Decision Tree.
7. Example of Decision Tree.
8. Future Work & References.
3/16/2013 Data Classification 3
4. Data Classification
Definition:-
Data Classification is a form of data analysis that
extracts models describing important data
classes. Such models called classifiers.
Classification techniques:-
1. Learning Step
2. Classification step
3/16/2013 Data Classification 4
5. Applications
Data Classification has numerous
applications
1. Data Classification used as fraud
detection.
2. Used for Target Marketing.
3. Used for Performance Prediction.
4. Used for Manufacturing.
5. Used for Medical Diagnosis.
3/16/2013 Data Classification 5
6. Decision tree Induction
Decision tree Induction is learning of decision trees
from class-labeled training tuples.
It a flowchart-like tree structure.
Each internal node denotes a test on an attribute.
Each Branch represents an outcome of the test.
And each Leaf node holds a class label.
The topmost in a tree is the Root node.
It consists of three parameters
1. D(Data Partition)
2. attribute-list
3. Attribute-selection-method
March 16, 2013 Data Mining: Concepts and Techniques 6
7. Decision tree Induction
D(Data Partition):-
It is a complete set of training tuple and their
associative class labels.
Attribute-list:-
It is a list of attributes describing the tuples.
Attribute-selection-method:-
It is a procedure to determine the splitting criterion
That “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and a splitting point.
March 16, 2013 Data Mining: Concepts and Techniques 7
8. Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner.
At start, all the training examples are at the root.
Examples are partitioned recursively based on
selected attributes.
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information
gain).
March 16, 2013 Data Mining: Concepts and Techniques 8
9. Algorithm for Decision Tree Induction
Conditions for stopping partitioning
All samples for a given node belong to the same
class.
There are no remaining attributes for further
There are no samples left.
3/16/2013 Data Classification 9
10. Formulae for Decision Tree Induction
These following formulae we have to Implement in
Decision Tree Algorithm.
The Expected information needed to classify a
tuple D is given by
m
Info ( D ) pi log 2 pi
i 1
Here „Pi‟ is the nonzero probability that an arbitrary
tuple in D belongs to class Ci.
How much information would we need after
partitioning to arrive at an exact classification.
v
| Dj |
Info A ( D ) Info ( Dj )
j 1 | D|
3/16/2013 Data Classification 10
11. Formulae for Decision Tree Induction
These following formulae we have to Implement in
Decision Tree Algorithm.
Information gain is defined as the difference
between original information requirement and the
new requirement after partitioning.
Gain ( A) Info ( D) Info A ( D)
In other words Gain(A) tells how much would be
gained by branching on A.
3/16/2013 Data Classification 11
12. Decision Tree Induction: Training
Dataset income student credit_rating buys_computer
age
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
March 16, 2013 Data Mining: Concepts and Techniques 12
13. Attribute Selection: Information Gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
9 9 5 5
Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940
14 14 14 14
5 4 5
Infoage ( D) I (2,3) I (4,0) I (3,2) 0.694
14 14 14
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
March 16, 2013 Data Mining: Concepts and Techniques 13
14. Attribute Selection: Information Gain
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
5
I (2,3) means “age <=30” has 5 out of 14
14
samples, with 2 yes's and 3 no‟s.
Hence
Gain (age) Info ( D ) Info age ( D ) 0.246
March 16, 2013 Data Mining: Concepts and Techniques 14
15. Decision Tree Construction
Similarly:-
Gain(income) 0.029
Gain( student) 0.151
Gain(credit _ rating) 0.048
The Highest information Gain among the
attributes will be taken as the splitting attribute.
So age is taken as the splitting attribute.
3/16/2013 Data Classification 15
16. A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes yes
March 16, 2013 Data Mining: Concepts and Techniques 16
18. References
Christopher J. C. Burges. 1998. A Tutorial on
Support Vector Machines for Pattern Recognition
S. T. Dumais. 1998. Using SVMs for text
categorization, IEEE Intelligent Systems, 13(4)
S. T. Dumais, J. Platt, D. Heckerman and M. Sahami.
1998.
Inductive learning algorithms and representations
for text categorization. CIKM ’98, pp. 148-155.
Yiming Yang, Xin Liu. 1999. A re-examination of text
categorization methods. 22nd Annual International
SIGIR Tong Zhang, Frank J. Oles. 2001.
Text Categorization Based on Regularized Linear
Classification Methods. Information Retrieval 4(1): 5-
31
3/16/2013 Data Classification 18