Text mining: Predictive models for text
Document Classification Document classification/categorization is a problem in text mining. The task is to assign an document to one or more categories, based on its contents
Classification Techniques nearest-neighbor Decision rules Probabilistic models Linear models
K-nearest neighbor algorithm Given a distance metric Assign class to be the same as its nearest neighbor All training data is used during operation Multi-class decision framework
K-nearest neighbor algorithm Simple algorithm is slow For each  x i if (dist( x ,  x i ) < min) nearest =  x i min = dist( x ,  x i ) Use data structures to speed up search
Advantages it is well suited for multi-modal classes as its classification decision is based on a small neighborhood of similar objects (i.e., the major class).  So, even if the target class is multi-modal (i.e., consists of objects whose independent variables have different characteristics for different subsets), it can still lead to good accuracy.
DRAWBACKS A major drawback of the similarity measure used in KNN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the features is useful for classification
Decision trees Decision trees are popular for pattern recognition because the models they produce are easier to understand. Root node A A B B B B Nodes of the tree Leaves (terminal nodes) of the tree Branches (decision point) of the tree C
Binary decision trees Since each inequality that is used to split the input space is only based on one input variable. Each node draws a boundary that can be geometrically interpreted as a hyper plane perpendicular to the axis
Linear decision trees Linear decision trees are similar to binary decision trees, except that the inequality computed at each node takes on an arbitrary linear from that may depend on multiple variables.
Probabilistic  model-Naive Bayes classifier a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature.  Example: a fruit may be considered to be an apple if it is red, round, and about 4&quot; in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
Parameter estimation All model parameters  can be approximated with relative frequencies from the training set.  If a given class and feature value never occur together in the training set then the frequency-based probability estimate will be zero. This is problematic since it will wipe out all information in the other probabilities when they are multiplied. It is therefore often desirable to incorporate a small-sample correction in all probability estimates such that no probability is ever set to be exactly zero.
Constructing a classifier from the probability model The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable.  This is known as the  maximum a posteriori  or  MAP  decision rule. The corresponding classifier is the function classify defined as follows: Classify(f1,f2…fn)=argmax p(C=c)∏ p(Fi=fi)/(C=c)
Performance Evaluation Given  n  test documents and  m  classes in consideration, a classifier makes  n      m  binary decisions.  A two-by-two contingency table can be computed for each class. Truly yes  truly no System yes  a  b System No  c  d
Performance Evaluation Recall = a/(a+c) where a + c > 0 (o.w. undefined).  Did we find all of those that belonged in the class? Precision = a/(a+b) where a+b>0 (o.w. undefined). Of the times we predicted it was “in class”, how often are we correct?
Performance Evaluation Accuracy = (a + d) / n When one classes is overwhelmingly in the majority, this may not paint an accurate picture. Others: miss, false alarm (fallout), error, F-measure, break-even point, ...
Applications Web pages organized into category hierarchies  Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.)  Responses to Census Bureau occupations  Patents archived using  International Patent Classification   Patient records coded using international insurance categories  E-mail message filtering  News events tracked and filtered by topics
conclusion In this presentation we learned about Document classification Classification techniques Performance evaluation and applications
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Textmining Predictive Models

  • 1.
    Text mining: Predictivemodels for text
  • 2.
    Document Classification Documentclassification/categorization is a problem in text mining. The task is to assign an document to one or more categories, based on its contents
  • 3.
    Classification Techniques nearest-neighborDecision rules Probabilistic models Linear models
  • 4.
    K-nearest neighbor algorithmGiven a distance metric Assign class to be the same as its nearest neighbor All training data is used during operation Multi-class decision framework
  • 5.
    K-nearest neighbor algorithmSimple algorithm is slow For each x i if (dist( x , x i ) < min) nearest = x i min = dist( x , x i ) Use data structures to speed up search
  • 6.
    Advantages it iswell suited for multi-modal classes as its classification decision is based on a small neighborhood of similar objects (i.e., the major class). So, even if the target class is multi-modal (i.e., consists of objects whose independent variables have different characteristics for different subsets), it can still lead to good accuracy.
  • 7.
    DRAWBACKS A majordrawback of the similarity measure used in KNN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the features is useful for classification
  • 8.
    Decision trees Decisiontrees are popular for pattern recognition because the models they produce are easier to understand. Root node A A B B B B Nodes of the tree Leaves (terminal nodes) of the tree Branches (decision point) of the tree C
  • 9.
    Binary decision treesSince each inequality that is used to split the input space is only based on one input variable. Each node draws a boundary that can be geometrically interpreted as a hyper plane perpendicular to the axis
  • 10.
    Linear decision treesLinear decision trees are similar to binary decision trees, except that the inequality computed at each node takes on an arbitrary linear from that may depend on multiple variables.
  • 11.
    Probabilistic model-NaiveBayes classifier a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4&quot; in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
  • 12.
    Parameter estimation Allmodel parameters can be approximated with relative frequencies from the training set. If a given class and feature value never occur together in the training set then the frequency-based probability estimate will be zero. This is problematic since it will wipe out all information in the other probabilities when they are multiplied. It is therefore often desirable to incorporate a small-sample correction in all probability estimates such that no probability is ever set to be exactly zero.
  • 13.
    Constructing a classifierfrom the probability model The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable. This is known as the maximum a posteriori or MAP decision rule. The corresponding classifier is the function classify defined as follows: Classify(f1,f2…fn)=argmax p(C=c)∏ p(Fi=fi)/(C=c)
  • 14.
    Performance Evaluation Given n test documents and m classes in consideration, a classifier makes n  m binary decisions. A two-by-two contingency table can be computed for each class. Truly yes truly no System yes a b System No c d
  • 15.
    Performance Evaluation Recall= a/(a+c) where a + c > 0 (o.w. undefined). Did we find all of those that belonged in the class? Precision = a/(a+b) where a+b>0 (o.w. undefined). Of the times we predicted it was “in class”, how often are we correct?
  • 16.
    Performance Evaluation Accuracy= (a + d) / n When one classes is overwhelmingly in the majority, this may not paint an accurate picture. Others: miss, false alarm (fallout), error, F-measure, break-even point, ...
  • 17.
    Applications Web pagesorganized into category hierarchies Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) Responses to Census Bureau occupations Patents archived using International Patent Classification Patient records coded using international insurance categories E-mail message filtering News events tracked and filtered by topics
  • 18.
    conclusion In thispresentation we learned about Document classification Classification techniques Performance evaluation and applications
  • 19.
    Visit more selfhelp tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net