Data classification

DATA CLASSIFICATION

Priyabrata satapathy
SIC NO.- MCS12121
Roll No-05

3/16/2013 Data Classification 1

Project Planning
 Introduction to Data Mining.
 Data Warehouse.
 Data Preprocessing.
 Data Cleaning.
 Classification Techniques.
 Problem Identification.


Contents
1. Introduction to Data Classification.
2. applications of data Classification.
3. Steps of Data Classification.
4. Decision Tree Induction.
5. Attributes of Decision Tree.
6. Algorithm for Decision Tree.
7. Example of Decision Tree.
8. Future Work & References.


Data Classification
Definition:-
Data Classification is a form of data analysis that
extracts models describing important data
classes. Such models called classifiers.

Classification techniques:-
1. Learning Step
2. Classification step


Applications

Data Classification has numerous
applications
1. Data Classification used as fraud
detection.
2. Used for Target Marketing.
3. Used for Performance Prediction.
4. Used for Manufacturing.
5. Used for Medical Diagnosis.


Decision tree Induction
Decision tree Induction is learning of decision trees
from class-labeled training tuples.
 It a flowchart-like tree structure.
 Each internal node denotes a test on an attribute.
 Each Branch represents an outcome of the test.
 And each Leaf node holds a class label.
 The topmost in a tree is the Root node.
It consists of three parameters
1. D(Data Partition)
2. attribute-list
3. Attribute-selection-method

March 16, 2013 Data Mining: Concepts and Techniques 6

Decision tree Induction
D(Data Partition):-
It is a complete set of training tuple and their
associative class labels.
Attribute-list:-
It is a list of attributes describing the tuples.
Attribute-selection-method:-
It is a procedure to determine the splitting criterion
That “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and a splitting point.


Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive
divide-and-conquer manner.
 At start, all the training examples are at the root.
 Examples are partitioned recursively based on
selected attributes.
 Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information
gain).


Algorithm for Decision Tree Induction

 Conditions for stopping partitioning
All samples for a given node belong to the same
class.
There are no remaining attributes for further
There are no samples left.


Formulae for Decision Tree Induction
These following formulae we have to Implement in
Decision Tree Algorithm.
 The Expected information needed to classify a
tuple D is given by
m
Info ( D ) pi log 2 pi
i 1

Here „Pi‟ is the nonzero probability that an arbitrary
tuple in D belongs to class Ci.

 How much information would we need after
partitioning to arrive at an exact classification.
v
| Dj |
Info A ( D ) Info ( Dj )
j 1 | D|


Formulae for Decision Tree Induction
These following formulae we have to Implement in
Decision Tree Algorithm.
 Information gain is defined as the difference
between original information requirement and the
new requirement after partitioning.
Gain ( A) Info ( D) Info A ( D)
 In other words Gain(A) tells how much would be
gained by branching on A.


Decision Tree Induction: Training
Dataset income student credit_rating buys_computer
age
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no


Attribute Selection: Information Gain

 Class P: buys_computer = “yes”
 Class N: buys_computer = “no”
9 9 5 5
Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940
14 14 14 14
5 4 5
Infoage ( D) I (2,3) I (4,0) I (3,2) 0.694
14 14 14
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971

Attribute Selection: Information Gain

age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971

5
I (2,3) means “age <=30” has 5 out of 14
14
samples, with 2 yes's and 3 no‟s.
Hence

Gain (age) Info ( D ) Info age ( D ) 0.246


Decision Tree Construction
Similarly:-
Gain(income) 0.029
Gain( student) 0.151
Gain(credit _ rating) 0.048

The Highest information Gain among the
attributes will be taken as the splitting attribute.
So age is taken as the splitting attribute.


A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes


Future Works
Implementation Of Other
Classification Techniques.
 Problem Identification.


References
 Christopher J. C. Burges. 1998. A Tutorial on
Support Vector Machines for Pattern Recognition
 S. T. Dumais. 1998. Using SVMs for text
categorization, IEEE Intelligent Systems, 13(4)
 S. T. Dumais, J. Platt, D. Heckerman and M. Sahami.
1998.
 Inductive learning algorithms and representations
for text categorization. CIKM ’98, pp. 148-155.
 Yiming Yang, Xin Liu. 1999. A re-examination of text
categorization methods. 22nd Annual International
SIGIR Tong Zhang, Frank J. Oles. 2001.
 Text Categorization Based on Regularized Linear
Classification Methods. Information Retrieval 4(1): 5-
31

Thank you


Data classification

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (20)

Similar to Data classification

Similar to Data classification (20)

Data classification