SlideShare a Scribd company logo
1 of 19
DATA CLASSIFICATION


         Priyabrata satapathy
         SIC NO.- MCS12121
         Roll No-05



         3/16/2013   Data Classification   1
Project Planning
   Introduction to Data Mining.
   Data Warehouse.
   Data Preprocessing.
   Data Cleaning.
   Classification Techniques.
   Problem Identification.

                     3/16/2013   Data Classification   2
Contents
1.   Introduction to Data Classification.
2.   applications of data Classification.
3.   Steps of Data Classification.
4.   Decision Tree Induction.
5.   Attributes of Decision Tree.
6.   Algorithm for Decision Tree.
7.   Example of Decision Tree.
8.   Future Work & References.

                    3/16/2013   Data Classification   3
Data Classification
Definition:-
  Data Classification is a form of data analysis that
  extracts models describing important data
  classes. Such models called classifiers.

   Classification techniques:-
   1. Learning Step
   2. Classification step




                         3/16/2013   Data Classification   4
Applications

Data Classification has numerous
applications
1. Data Classification used as fraud
   detection.
2. Used for Target Marketing.
3. Used for Performance Prediction.
4. Used for Manufacturing.
5. Used for Medical Diagnosis.




                    3/16/2013   Data Classification   5
Decision tree Induction
Decision tree Induction is learning of decision trees
from class-labeled training tuples.
 It a flowchart-like tree structure.
 Each internal node denotes a test on an attribute.
 Each Branch represents an outcome of the test.
 And each Leaf node holds a class label.
 The topmost in a tree is the Root node.
It consists of three parameters
1. D(Data Partition)
2. attribute-list
3. Attribute-selection-method



                        March 16, 2013   Data Mining: Concepts and Techniques   6
Decision tree Induction
D(Data Partition):-
It is a complete set of training tuple and their
   associative class labels.
Attribute-list:-
It is a list of attributes describing the tuples.
Attribute-selection-method:-
It is a procedure to determine the splitting criterion
That “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and a splitting point.




                           March 16, 2013   Data Mining: Concepts and Techniques   7
Algorithm for Decision Tree Induction
   Basic algorithm (a greedy algorithm)
     Tree is constructed in a top-down recursive
     divide-and-conquer manner.
     At start, all the training examples are at the root.
     Examples are partitioned recursively based on
     selected attributes.
     Test attributes are selected on the basis of a
     heuristic or statistical measure (e.g., information
     gain).




                           March 16, 2013   Data Mining: Concepts and Techniques   8
Algorithm for Decision Tree Induction

     Conditions for stopping partitioning
      All samples for a given node belong to the same
       class.
      There are no remaining attributes for further
      There are no samples left.




                             3/16/2013   Data Classification   9
Formulae for Decision Tree Induction
These following formulae we have to Implement in
  Decision Tree Algorithm.
 The Expected information needed to classify a
   tuple D is given by
                                m
               Info ( D )                 pi log 2 pi
                                i 1

Here „Pi‟ is the nonzero probability that an arbitrary
 tuple in D belongs to class Ci.

   How much information would we need after
    partitioning to arrive at an exact classification.
                                      v
                                            | Dj |
                 Info A ( D )                      Info ( Dj )
                                  j       1 | D|

                                    3/16/2013     Data Classification   10
Formulae for Decision Tree Induction
These following formulae we have to Implement in
  Decision Tree Algorithm.
  Information gain is defined as the difference
   between original information requirement and the
   new requirement after partitioning.
Gain ( A)        Info ( D)            Info A ( D)
   In other words Gain(A) tells how much would be
    gained by branching on A.




                         3/16/2013   Data Classification   11
Decision Tree Induction: Training
Dataset income student credit_rating buys_computer
    age
    <=30    high     no       fair                             no
    <=30    high     no       excellent                        no
    31…40   high     no       fair                             yes
    >40     medium   no       fair                             yes
    >40     low      yes      fair                             yes
    >40     low      yes      excellent                        no
    31…40   low      yes      excellent                        yes
    <=30    medium   no       fair                             no
    <=30    low      yes      fair                             yes
    >40     medium   yes      fair                             yes
    <=30    medium   yes      excellent                        yes
    31…40   medium   no       excellent                        yes
    31…40   high     yes      fair                             yes
    >40     medium   no       excellent                        no


                           March 16, 2013   Data Mining: Concepts and Techniques   12
Attribute Selection: Information Gain

     Class P: buys_computer = “yes”
     Class N: buys_computer = “no”
                          9         9  5         5
Info( D)     I (9,5)        log 2 ( )    log 2 ( ) 0.940
                         14        14 14        14
                 5          4          5
Infoage ( D)       I (2,3)    I (4,0)    I (3,2)                          0.694
                14         14         14
       age  pi             ni I(pi, ni)
      <=30  2              3 0.971
      31…40 4              0 0
      >40   3              2 0.971
                              March 16, 2013   Data Mining: Concepts and Techniques   13
Attribute Selection: Information Gain

   age            pi       ni        I(pi, ni)
  <=30            2        3        0.971
  31…40           4        0        0
  >40             3        2        0.971


  5
    I (2,3) means “age <=30” has 5 out of 14
 14
       samples, with 2 yes's and 3 no‟s.
     Hence

Gain (age)    Info ( D ) Info age ( D )             0.246



                        March 16, 2013   Data Mining: Concepts and Techniques   14
Decision Tree Construction
Similarly:-
         Gain(income) 0.029
         Gain( student) 0.151
         Gain(credit _ rating) 0.048

The Highest information Gain among the
attributes will be taken as the splitting attribute.
So age is taken as the splitting attribute.




                         3/16/2013   Data Classification   15
A Decision Tree for “buys_computer”

                              age?


             <=30            overcast
                              31..40                >40


          student?             yes                     credit rating?

     no              yes                     excellent                  fair

no                     yes                                                  yes



                                 March 16, 2013   Data Mining: Concepts and Techniques   16
Future Works
Implementation Of Other
Classification Techniques.
 Problem Identification.




                  3/16/2013   Data Classification   17
References
 Christopher J. C. Burges. 1998. A Tutorial on
Support Vector Machines for Pattern Recognition
 S. T. Dumais. 1998. Using SVMs for text
categorization, IEEE Intelligent Systems, 13(4)
 S. T. Dumais, J. Platt, D. Heckerman and M. Sahami.
1998.
 Inductive learning algorithms and representations
for text categorization. CIKM ’98, pp. 148-155.
 Yiming Yang, Xin Liu. 1999. A re-examination of text
categorization methods. 22nd Annual International
SIGIR Tong Zhang, Frank J. Oles. 2001.
 Text Categorization Based on Regularized Linear
Classification Methods. Information Retrieval 4(1): 5-
31
                         3/16/2013   Data Classification   18
Thank you




    3/16/2013   Data Classification   19

More Related Content

What's hot

Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification predictionKamal Singh Lodhi
 
Deployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionDeployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionijtsrd
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...butest
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestMohamed Medhat Gaber
 
Data mining 2 exploratory data analysis
Data mining 2   exploratory data analysisData mining 2   exploratory data analysis
Data mining 2 exploratory data analysisIrwansyahSaputra1
 
Classification decision tree
Classification  decision treeClassification  decision tree
Classification decision treeyazad dumasia
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.pptbutest
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
 
Classification and Prediction
Classification and PredictionClassification and Prediction
Classification and PredictionSahilKumar542
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
 
WEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsWEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsDataminingTools Inc
 

What's hot (16)

Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification prediction
 
Deployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionDeployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement prediction
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
 
Data mining 2 exploratory data analysis
Data mining 2   exploratory data analysisData mining 2   exploratory data analysis
Data mining 2 exploratory data analysis
 
Classification decision tree
Classification  decision treeClassification  decision tree
Classification decision tree
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
Machine learning
Machine learningMachine learning
Machine learning
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
Classification and Prediction
Classification and PredictionClassification and Prediction
Classification and Prediction
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
 
WEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsWEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic Methods
 

Viewers also liked

Classification of data
Classification of dataClassification of data
Classification of datarajni singal
 
Data Classification Presentation
Data Classification PresentationData Classification Presentation
Data Classification PresentationDerroylo
 
Classification of data
Classification of dataClassification of data
Classification of dataligaya06
 
Collection, classification and presentation of data
Collection, classification and presentation of dataCollection, classification and presentation of data
Collection, classification and presentation of dataNidhi
 
A novel medical image segmentation and classification using combined feature ...
A novel medical image segmentation and classification using combined feature ...A novel medical image segmentation and classification using combined feature ...
A novel medical image segmentation and classification using combined feature ...eSAT Journals
 
General Counsel Presentation
General Counsel PresentationGeneral Counsel Presentation
General Counsel PresentationChristine Klocke
 
Governance at UNSW Aaron Magner 2009
Governance at UNSW Aaron Magner 2009Governance at UNSW Aaron Magner 2009
Governance at UNSW Aaron Magner 2009Aaron Magner
 
Division of General Counsel
Division of General CounselDivision of General Counsel
Division of General CounselJackson State
 
Chief legal counsel performance appraisal
Chief legal counsel performance appraisalChief legal counsel performance appraisal
Chief legal counsel performance appraisalgriffinbrandon276
 
Personally Identifiable Information Protection
Personally Identifiable Information ProtectionPersonally Identifiable Information Protection
Personally Identifiable Information ProtectionPECB
 
Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...
Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...
Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...Daniel W. Linna Jr.
 
Managing Personally Identifiable Information (PII)
Managing Personally Identifiable Information (PII)Managing Personally Identifiable Information (PII)
Managing Personally Identifiable Information (PII)KP Naidu
 
A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...ASWATHY VG
 
Basics of statistics
Basics of statisticsBasics of statistics
Basics of statisticsGaurav Kr
 
Research professional activity network analysis2
Research professional activity network analysis2Research professional activity network analysis2
Research professional activity network analysis2Silicon
 

Viewers also liked (20)

Classification of data
Classification of dataClassification of data
Classification of data
 
Data Classification Presentation
Data Classification PresentationData Classification Presentation
Data Classification Presentation
 
Classification of data
Classification of dataClassification of data
Classification of data
 
Classification & tabulation of data
Classification & tabulation of dataClassification & tabulation of data
Classification & tabulation of data
 
Tabulation
Tabulation Tabulation
Tabulation
 
Collection, classification and presentation of data
Collection, classification and presentation of dataCollection, classification and presentation of data
Collection, classification and presentation of data
 
Tabulation
TabulationTabulation
Tabulation
 
A novel medical image segmentation and classification using combined feature ...
A novel medical image segmentation and classification using combined feature ...A novel medical image segmentation and classification using combined feature ...
A novel medical image segmentation and classification using combined feature ...
 
General Counsel Presentation
General Counsel PresentationGeneral Counsel Presentation
General Counsel Presentation
 
Governance at UNSW Aaron Magner 2009
Governance at UNSW Aaron Magner 2009Governance at UNSW Aaron Magner 2009
Governance at UNSW Aaron Magner 2009
 
Division of General Counsel
Division of General CounselDivision of General Counsel
Division of General Counsel
 
Chief legal counsel performance appraisal
Chief legal counsel performance appraisalChief legal counsel performance appraisal
Chief legal counsel performance appraisal
 
Personally Identifiable Information Protection
Personally Identifiable Information ProtectionPersonally Identifiable Information Protection
Personally Identifiable Information Protection
 
18 Tips for Data Classification - Data Sheet by Secure Islands
18 Tips for Data Classification - Data Sheet by Secure Islands18 Tips for Data Classification - Data Sheet by Secure Islands
18 Tips for Data Classification - Data Sheet by Secure Islands
 
Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...
Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...
Applying Lean Thinking to Legal-Service Delivery - Lean Process Improvement a...
 
Corporate compliance
Corporate compliance Corporate compliance
Corporate compliance
 
Managing Personally Identifiable Information (PII)
Managing Personally Identifiable Information (PII)Managing Personally Identifiable Information (PII)
Managing Personally Identifiable Information (PII)
 
A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...A supervised lung nodule classification method using patch based context anal...
A supervised lung nodule classification method using patch based context anal...
 
Basics of statistics
Basics of statisticsBasics of statistics
Basics of statistics
 
Research professional activity network analysis2
Research professional activity network analysis2Research professional activity network analysis2
Research professional activity network analysis2
 

Similar to Data classification

unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptxssuser908de6
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).pptrajasamal1999
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxHimanshuSharma997566
 
Chapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesChapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesHouw Liong The
 
Model Preparation, Evaluation and Feature Engineering.pptx
Model Preparation, Evaluation and Feature Engineering.pptxModel Preparation, Evaluation and Feature Engineering.pptx
Model Preparation, Evaluation and Feature Engineering.pptxssuser29bd741
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...ShivarkarSandip
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptSubrata Kumer Paul
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppthenonah
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08 Jeet Das
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Machine learning
Machine learningMachine learning
Machine learningRohit Kumar
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf321106410027
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101AmmarChalifah
 

Similar to Data classification (20)

unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptx
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).ppt
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
 
Chapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesChapter 06 Data Mining Techniques
Chapter 06 Data Mining Techniques
 
Model Preparation, Evaluation and Feature Engineering.pptx
Model Preparation, Evaluation and Feature Engineering.pptxModel Preparation, Evaluation and Feature Engineering.pptx
Model Preparation, Evaluation and Feature Engineering.pptx
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Data Mining
Data MiningData Mining
Data Mining
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
7class
7class7class
7class
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Machine learning
Machine learningMachine learning
Machine learning
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Data mining
Data miningData mining
Data mining
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 

Data classification

  • 1. DATA CLASSIFICATION Priyabrata satapathy SIC NO.- MCS12121 Roll No-05 3/16/2013 Data Classification 1
  • 2. Project Planning  Introduction to Data Mining.  Data Warehouse.  Data Preprocessing.  Data Cleaning.  Classification Techniques.  Problem Identification. 3/16/2013 Data Classification 2
  • 3. Contents 1. Introduction to Data Classification. 2. applications of data Classification. 3. Steps of Data Classification. 4. Decision Tree Induction. 5. Attributes of Decision Tree. 6. Algorithm for Decision Tree. 7. Example of Decision Tree. 8. Future Work & References. 3/16/2013 Data Classification 3
  • 4. Data Classification Definition:- Data Classification is a form of data analysis that extracts models describing important data classes. Such models called classifiers. Classification techniques:- 1. Learning Step 2. Classification step 3/16/2013 Data Classification 4
  • 5. Applications Data Classification has numerous applications 1. Data Classification used as fraud detection. 2. Used for Target Marketing. 3. Used for Performance Prediction. 4. Used for Manufacturing. 5. Used for Medical Diagnosis. 3/16/2013 Data Classification 5
  • 6. Decision tree Induction Decision tree Induction is learning of decision trees from class-labeled training tuples.  It a flowchart-like tree structure.  Each internal node denotes a test on an attribute.  Each Branch represents an outcome of the test.  And each Leaf node holds a class label.  The topmost in a tree is the Root node. It consists of three parameters 1. D(Data Partition) 2. attribute-list 3. Attribute-selection-method March 16, 2013 Data Mining: Concepts and Techniques 6
  • 7. Decision tree Induction D(Data Partition):- It is a complete set of training tuple and their associative class labels. Attribute-list:- It is a list of attributes describing the tuples. Attribute-selection-method:- It is a procedure to determine the splitting criterion That “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and a splitting point. March 16, 2013 Data Mining: Concepts and Techniques 7
  • 8. Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner.  At start, all the training examples are at the root.  Examples are partitioned recursively based on selected attributes.  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain). March 16, 2013 Data Mining: Concepts and Techniques 8
  • 9. Algorithm for Decision Tree Induction  Conditions for stopping partitioning All samples for a given node belong to the same class. There are no remaining attributes for further There are no samples left. 3/16/2013 Data Classification 9
  • 10. Formulae for Decision Tree Induction These following formulae we have to Implement in Decision Tree Algorithm.  The Expected information needed to classify a tuple D is given by m Info ( D ) pi log 2 pi i 1 Here „Pi‟ is the nonzero probability that an arbitrary tuple in D belongs to class Ci.  How much information would we need after partitioning to arrive at an exact classification. v | Dj | Info A ( D ) Info ( Dj ) j 1 | D| 3/16/2013 Data Classification 10
  • 11. Formulae for Decision Tree Induction These following formulae we have to Implement in Decision Tree Algorithm.  Information gain is defined as the difference between original information requirement and the new requirement after partitioning. Gain ( A) Info ( D) Info A ( D)  In other words Gain(A) tells how much would be gained by branching on A. 3/16/2013 Data Classification 11
  • 12. Decision Tree Induction: Training Dataset income student credit_rating buys_computer age <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no March 16, 2013 Data Mining: Concepts and Techniques 12
  • 13. Attribute Selection: Information Gain  Class P: buys_computer = “yes”  Class N: buys_computer = “no” 9 9 5 5 Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940 14 14 14 14 5 4 5 Infoage ( D) I (2,3) I (4,0) I (3,2) 0.694 14 14 14 age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 March 16, 2013 Data Mining: Concepts and Techniques 13
  • 14. Attribute Selection: Information Gain age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 5 I (2,3) means “age <=30” has 5 out of 14 14 samples, with 2 yes's and 3 no‟s. Hence Gain (age) Info ( D ) Info age ( D ) 0.246 March 16, 2013 Data Mining: Concepts and Techniques 14
  • 15. Decision Tree Construction Similarly:- Gain(income) 0.029 Gain( student) 0.151 Gain(credit _ rating) 0.048 The Highest information Gain among the attributes will be taken as the splitting attribute. So age is taken as the splitting attribute. 3/16/2013 Data Classification 15
  • 16. A Decision Tree for “buys_computer” age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes March 16, 2013 Data Mining: Concepts and Techniques 16
  • 17. Future Works Implementation Of Other Classification Techniques.  Problem Identification. 3/16/2013 Data Classification 17
  • 18. References  Christopher J. C. Burges. 1998. A Tutorial on Support Vector Machines for Pattern Recognition  S. T. Dumais. 1998. Using SVMs for text categorization, IEEE Intelligent Systems, 13(4)  S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998.  Inductive learning algorithms and representations for text categorization. CIKM ’98, pp. 148-155.  Yiming Yang, Xin Liu. 1999. A re-examination of text categorization methods. 22nd Annual International SIGIR Tong Zhang, Frank J. Oles. 2001.  Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5- 31 3/16/2013 Data Classification 18
  • 19. Thank you 3/16/2013 Data Classification 19