DATA CLASSIFICATION


         Priyabrata satapathy
         SIC NO.- MCS12121
         Roll No-05



         3/16/2013   Data Classification   1
Project Planning
   Introduction to Data Mining.
   Data Warehouse.
   Data Preprocessing.
   Data Cleaning.
   Classification Techniques.
   Problem Identification.

                     3/16/2013   Data Classification   2
Contents
1.   Introduction to Data Classification.
2.   applications of data Classification.
3.   Steps of Data Classification.
4.   Decision Tree Induction.
5.   Attributes of Decision Tree.
6.   Algorithm for Decision Tree.
7.   Example of Decision Tree.
8.   Future Work & References.

                    3/16/2013   Data Classification   3
Data Classification
Definition:-
  Data Classification is a form of data analysis that
  extracts models describing important data
  classes. Such models called classifiers.

   Classification techniques:-
   1. Learning Step
   2. Classification step




                         3/16/2013   Data Classification   4
Applications

Data Classification has numerous
applications
1. Data Classification used as fraud
   detection.
2. Used for Target Marketing.
3. Used for Performance Prediction.
4. Used for Manufacturing.
5. Used for Medical Diagnosis.




                    3/16/2013   Data Classification   5
Decision tree Induction
Decision tree Induction is learning of decision trees
from class-labeled training tuples.
 It a flowchart-like tree structure.
 Each internal node denotes a test on an attribute.
 Each Branch represents an outcome of the test.
 And each Leaf node holds a class label.
 The topmost in a tree is the Root node.
It consists of three parameters
1. D(Data Partition)
2. attribute-list
3. Attribute-selection-method



                        March 16, 2013   Data Mining: Concepts and Techniques   6
Decision tree Induction
D(Data Partition):-
It is a complete set of training tuple and their
   associative class labels.
Attribute-list:-
It is a list of attributes describing the tuples.
Attribute-selection-method:-
It is a procedure to determine the splitting criterion
That “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and a splitting point.




                           March 16, 2013   Data Mining: Concepts and Techniques   7
Algorithm for Decision Tree Induction
   Basic algorithm (a greedy algorithm)
     Tree is constructed in a top-down recursive
     divide-and-conquer manner.
     At start, all the training examples are at the root.
     Examples are partitioned recursively based on
     selected attributes.
     Test attributes are selected on the basis of a
     heuristic or statistical measure (e.g., information
     gain).




                           March 16, 2013   Data Mining: Concepts and Techniques   8
Algorithm for Decision Tree Induction

     Conditions for stopping partitioning
      All samples for a given node belong to the same
       class.
      There are no remaining attributes for further
      There are no samples left.




                             3/16/2013   Data Classification   9
Formulae for Decision Tree Induction
These following formulae we have to Implement in
  Decision Tree Algorithm.
 The Expected information needed to classify a
   tuple D is given by
                                m
               Info ( D )                 pi log 2 pi
                                i 1

Here „Pi‟ is the nonzero probability that an arbitrary
 tuple in D belongs to class Ci.

   How much information would we need after
    partitioning to arrive at an exact classification.
                                      v
                                            | Dj |
                 Info A ( D )                      Info ( Dj )
                                  j       1 | D|

                                    3/16/2013     Data Classification   10
Formulae for Decision Tree Induction
These following formulae we have to Implement in
  Decision Tree Algorithm.
  Information gain is defined as the difference
   between original information requirement and the
   new requirement after partitioning.
Gain ( A)        Info ( D)            Info A ( D)
   In other words Gain(A) tells how much would be
    gained by branching on A.




                         3/16/2013   Data Classification   11
Decision Tree Induction: Training
Dataset income student credit_rating buys_computer
    age
    <=30    high     no       fair                             no
    <=30    high     no       excellent                        no
    31…40   high     no       fair                             yes
    >40     medium   no       fair                             yes
    >40     low      yes      fair                             yes
    >40     low      yes      excellent                        no
    31…40   low      yes      excellent                        yes
    <=30    medium   no       fair                             no
    <=30    low      yes      fair                             yes
    >40     medium   yes      fair                             yes
    <=30    medium   yes      excellent                        yes
    31…40   medium   no       excellent                        yes
    31…40   high     yes      fair                             yes
    >40     medium   no       excellent                        no


                           March 16, 2013   Data Mining: Concepts and Techniques   12
Attribute Selection: Information Gain

     Class P: buys_computer = “yes”
     Class N: buys_computer = “no”
                          9         9  5         5
Info( D)     I (9,5)        log 2 ( )    log 2 ( ) 0.940
                         14        14 14        14
                 5          4          5
Infoage ( D)       I (2,3)    I (4,0)    I (3,2)                          0.694
                14         14         14
       age  pi             ni I(pi, ni)
      <=30  2              3 0.971
      31…40 4              0 0
      >40   3              2 0.971
                              March 16, 2013   Data Mining: Concepts and Techniques   13
Attribute Selection: Information Gain

   age            pi       ni        I(pi, ni)
  <=30            2        3        0.971
  31…40           4        0        0
  >40             3        2        0.971


  5
    I (2,3) means “age <=30” has 5 out of 14
 14
       samples, with 2 yes's and 3 no‟s.
     Hence

Gain (age)    Info ( D ) Info age ( D )             0.246



                        March 16, 2013   Data Mining: Concepts and Techniques   14
Decision Tree Construction
Similarly:-
         Gain(income) 0.029
         Gain( student) 0.151
         Gain(credit _ rating) 0.048

The Highest information Gain among the
attributes will be taken as the splitting attribute.
So age is taken as the splitting attribute.




                         3/16/2013   Data Classification   15
A Decision Tree for “buys_computer”

                              age?


             <=30            overcast
                              31..40                >40


          student?             yes                     credit rating?

     no              yes                     excellent                  fair

no                     yes                                                  yes



                                 March 16, 2013   Data Mining: Concepts and Techniques   16
Future Works
Implementation Of Other
Classification Techniques.
 Problem Identification.




                  3/16/2013   Data Classification   17
References
 Christopher J. C. Burges. 1998. A Tutorial on
Support Vector Machines for Pattern Recognition
 S. T. Dumais. 1998. Using SVMs for text
categorization, IEEE Intelligent Systems, 13(4)
 S. T. Dumais, J. Platt, D. Heckerman and M. Sahami.
1998.
 Inductive learning algorithms and representations
for text categorization. CIKM ’98, pp. 148-155.
 Yiming Yang, Xin Liu. 1999. A re-examination of text
categorization methods. 22nd Annual International
SIGIR Tong Zhang, Frank J. Oles. 2001.
 Text Categorization Based on Regularized Linear
Classification Methods. Information Retrieval 4(1): 5-
31
                         3/16/2013   Data Classification   18
Thank you




    3/16/2013   Data Classification   19

Data classification

  • 1.
    DATA CLASSIFICATION Priyabrata satapathy SIC NO.- MCS12121 Roll No-05 3/16/2013 Data Classification 1
  • 2.
    Project Planning  Introduction to Data Mining.  Data Warehouse.  Data Preprocessing.  Data Cleaning.  Classification Techniques.  Problem Identification. 3/16/2013 Data Classification 2
  • 3.
    Contents 1. Introduction to Data Classification. 2. applications of data Classification. 3. Steps of Data Classification. 4. Decision Tree Induction. 5. Attributes of Decision Tree. 6. Algorithm for Decision Tree. 7. Example of Decision Tree. 8. Future Work & References. 3/16/2013 Data Classification 3
  • 4.
    Data Classification Definition:- Data Classification is a form of data analysis that extracts models describing important data classes. Such models called classifiers. Classification techniques:- 1. Learning Step 2. Classification step 3/16/2013 Data Classification 4
  • 5.
    Applications Data Classification hasnumerous applications 1. Data Classification used as fraud detection. 2. Used for Target Marketing. 3. Used for Performance Prediction. 4. Used for Manufacturing. 5. Used for Medical Diagnosis. 3/16/2013 Data Classification 5
  • 6.
    Decision tree Induction Decisiontree Induction is learning of decision trees from class-labeled training tuples.  It a flowchart-like tree structure.  Each internal node denotes a test on an attribute.  Each Branch represents an outcome of the test.  And each Leaf node holds a class label.  The topmost in a tree is the Root node. It consists of three parameters 1. D(Data Partition) 2. attribute-list 3. Attribute-selection-method March 16, 2013 Data Mining: Concepts and Techniques 6
  • 7.
    Decision tree Induction D(DataPartition):- It is a complete set of training tuple and their associative class labels. Attribute-list:- It is a list of attributes describing the tuples. Attribute-selection-method:- It is a procedure to determine the splitting criterion That “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and a splitting point. March 16, 2013 Data Mining: Concepts and Techniques 7
  • 8.
    Algorithm for DecisionTree Induction  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner.  At start, all the training examples are at the root.  Examples are partitioned recursively based on selected attributes.  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain). March 16, 2013 Data Mining: Concepts and Techniques 8
  • 9.
    Algorithm for DecisionTree Induction  Conditions for stopping partitioning All samples for a given node belong to the same class. There are no remaining attributes for further There are no samples left. 3/16/2013 Data Classification 9
  • 10.
    Formulae for DecisionTree Induction These following formulae we have to Implement in Decision Tree Algorithm.  The Expected information needed to classify a tuple D is given by m Info ( D ) pi log 2 pi i 1 Here „Pi‟ is the nonzero probability that an arbitrary tuple in D belongs to class Ci.  How much information would we need after partitioning to arrive at an exact classification. v | Dj | Info A ( D ) Info ( Dj ) j 1 | D| 3/16/2013 Data Classification 10
  • 11.
    Formulae for DecisionTree Induction These following formulae we have to Implement in Decision Tree Algorithm.  Information gain is defined as the difference between original information requirement and the new requirement after partitioning. Gain ( A) Info ( D) Info A ( D)  In other words Gain(A) tells how much would be gained by branching on A. 3/16/2013 Data Classification 11
  • 12.
    Decision Tree Induction:Training Dataset income student credit_rating buys_computer age <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no March 16, 2013 Data Mining: Concepts and Techniques 12
  • 13.
    Attribute Selection: InformationGain  Class P: buys_computer = “yes”  Class N: buys_computer = “no” 9 9 5 5 Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940 14 14 14 14 5 4 5 Infoage ( D) I (2,3) I (4,0) I (3,2) 0.694 14 14 14 age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 March 16, 2013 Data Mining: Concepts and Techniques 13
  • 14.
    Attribute Selection: InformationGain age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 5 I (2,3) means “age <=30” has 5 out of 14 14 samples, with 2 yes's and 3 no‟s. Hence Gain (age) Info ( D ) Info age ( D ) 0.246 March 16, 2013 Data Mining: Concepts and Techniques 14
  • 15.
    Decision Tree Construction Similarly:- Gain(income) 0.029 Gain( student) 0.151 Gain(credit _ rating) 0.048 The Highest information Gain among the attributes will be taken as the splitting attribute. So age is taken as the splitting attribute. 3/16/2013 Data Classification 15
  • 16.
    A Decision Treefor “buys_computer” age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes March 16, 2013 Data Mining: Concepts and Techniques 16
  • 17.
    Future Works Implementation OfOther Classification Techniques.  Problem Identification. 3/16/2013 Data Classification 17
  • 18.
    References  Christopher J.C. Burges. 1998. A Tutorial on Support Vector Machines for Pattern Recognition  S. T. Dumais. 1998. Using SVMs for text categorization, IEEE Intelligent Systems, 13(4)  S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998.  Inductive learning algorithms and representations for text categorization. CIKM ’98, pp. 148-155.  Yiming Yang, Xin Liu. 1999. A re-examination of text categorization methods. 22nd Annual International SIGIR Tong Zhang, Frank J. Oles. 2001.  Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5- 31 3/16/2013 Data Classification 18
  • 19.
    Thank you 3/16/2013 Data Classification 19