 Random forest is a classifier
 An ensemble classifier using many decision tree models.
 Can be used for classification and regression
 Accuracy and variable importance information is provided with the result
 A random forest is a collection of unpruned CART-like trees following specific
rules for
 Tree growing
 Tree combination
 Self-testing
 Post-processing
 Trees are grown using binary partitioning
 Similar to decision tree with a few differences
 For each split-point, the search is not over all variables but just over a part of variables
 No pruning necessary. Trees can be grown until each node contain just very few
observations
 Advantages over decision tree
 Better prediction (in general)
 No parameter tuning necessary with RF
 Terminology
 Training size (N)
 Total number of attributes (M)
 Number of attributes used (m)
 Total number of trees (n)
 A random seed is chosen which pulls out at random a collection of samples from
training dataset while maintaining the class distribution
 With this selected dataset, a random set of attributes from original dataset is
chosen based on user defined values. All the input variables are not considered
because of enormous computation and high chances of over fitting
 In a dataset, where M is the total number of input attributes in the dataset, only
m attributes are chosen at random for each tree where m<M
 The attribute for this set creates the best possible split using the gini index to
develop a decision tree model. This process repeats for each of the branches until
the termination condition stating that the leaves are the nodes that are too small
to split.
 Information from random forest
 Classification accuracy
 Variable importance
 Outliers (Classification)
 Missing Data Estimation
 Error Rates for Random Forest Object
 Advantages
 No need for pruning trees
 Accuracy and variable importance generated automatically
 Overfitting is not a problem
 Not very sensitive to outliers in training data
 Easy to set parameters
 Limitations
 Regression cant predict beyond range in the training data
 Extreme values are not predicted accurately
 Applications
 Classification
 Land cover classification
 Cloud screening
 Regression
 Continuous field mapping
 Biomass mapping
 Efficient use of Multi-Core Technology
 Though it is OS dependent, but the usage of Hadoop guarantees efficient use of
multi-core
 Its a technique from machine learning for learning a linear classifier from labelled
examples
 Similar to perceptron algorithm
 While perceptron algorithm uses additive weight-update scheme, winnowing uses
a multiplicative weight-update scheme
 Performs well when many of the features given to the learner turns out to be
irrelevant
 During training, its shown a sequence of positive and negative examples. From
these it learn a decision hyperplane which can be used to novel examples as
positive or negative
 Uses linear threshold function (like the perceptron training algorithm) as
hypothesis and performs incremental updates to its current hypothesis
 Initialize the weights w1,…….wn to 1
 Both winnow and perceptron algorithm uses the same classification scheme
 The winnowing algorithms differs form the perceptron algorithm in its updating
scheme.
 When misclassifying a positive training example x (i.e. a prediction was negative because
w.x was too small)
 When misclassifying a negative training example x (i.e. Prediction was positive because
w.x was too large)
SPAM Example – each email is a Boolean vector indicating which phase appears
and which don’t
SPAM if at least one of the phrase in S is present
 Initialize the weights w1, …..wn = 1 on the n variables
 Given an example x = (x1,……..xn), output 1 if
 Else output 0
 If the algorithm makes a mistake:
 On positive – if it predicts 0 when f(x)=1, then for each xi equal to 1, double the value of
wi
 On negative – if it predicts 1 when f(x)=0, then for each xi equal to 1 cut the value of wi
in half
 The principle of maximum entropy states that, subject to precisely stated prior
data, the probability distribution which best represents the current state of
knowledge is the one with the largest entropy.
 Commonly used in Natural Language Processing, speech and Information
Retrieval
 What is maximum entropy classifier?
 Probabilistic classifier which belongs to the class of exponential models
 Does not assume the features that are conditionally independent of each other
 Based on the principle of maximum entropy and forms all models that fit our training
data and selects the one which has the largest entropy
 A piece of information is testable if it can be determined whether a given
distribution is consistent with it
 The expectation of variable x is 2.87
 And p2 + p3 > 0.6
 Are statements of testable information
 Maximum entropy procedure consist of seeking the probability distribution which
maximizes information entropy, subject to constrains of the information.
 Entropy maximization takes place under a single constrain: the sum of
probabilities must be one
 When to use maximum entropy?
 Since it makes minimum assumptions, we use it when we don’t know about the prior
distribution
 Used when we cannot assume conditional independence of the features
 The principle of maximum entropy is commonly applied in two ways to inferential
problems
 Prior Probabilities: its often used to obtain prior probability distribution for Bayesian
inference
 Maximum Entropy Models: involved in model specifications which are widely used in
natural language processing. Ex. Logistic regression

Random forest

  • 2.
     Random forestis a classifier  An ensemble classifier using many decision tree models.  Can be used for classification and regression  Accuracy and variable importance information is provided with the result  A random forest is a collection of unpruned CART-like trees following specific rules for  Tree growing  Tree combination  Self-testing  Post-processing  Trees are grown using binary partitioning
  • 3.
     Similar todecision tree with a few differences  For each split-point, the search is not over all variables but just over a part of variables  No pruning necessary. Trees can be grown until each node contain just very few observations  Advantages over decision tree  Better prediction (in general)  No parameter tuning necessary with RF  Terminology  Training size (N)  Total number of attributes (M)  Number of attributes used (m)  Total number of trees (n)
  • 4.
     A randomseed is chosen which pulls out at random a collection of samples from training dataset while maintaining the class distribution  With this selected dataset, a random set of attributes from original dataset is chosen based on user defined values. All the input variables are not considered because of enormous computation and high chances of over fitting  In a dataset, where M is the total number of input attributes in the dataset, only m attributes are chosen at random for each tree where m<M  The attribute for this set creates the best possible split using the gini index to develop a decision tree model. This process repeats for each of the branches until the termination condition stating that the leaves are the nodes that are too small to split.
  • 5.
     Information fromrandom forest  Classification accuracy  Variable importance  Outliers (Classification)  Missing Data Estimation  Error Rates for Random Forest Object  Advantages  No need for pruning trees  Accuracy and variable importance generated automatically  Overfitting is not a problem  Not very sensitive to outliers in training data  Easy to set parameters
  • 6.
     Limitations  Regressioncant predict beyond range in the training data  Extreme values are not predicted accurately  Applications  Classification  Land cover classification  Cloud screening  Regression  Continuous field mapping  Biomass mapping
  • 7.
     Efficient useof Multi-Core Technology  Though it is OS dependent, but the usage of Hadoop guarantees efficient use of multi-core
  • 8.
     Its atechnique from machine learning for learning a linear classifier from labelled examples  Similar to perceptron algorithm  While perceptron algorithm uses additive weight-update scheme, winnowing uses a multiplicative weight-update scheme  Performs well when many of the features given to the learner turns out to be irrelevant  During training, its shown a sequence of positive and negative examples. From these it learn a decision hyperplane which can be used to novel examples as positive or negative  Uses linear threshold function (like the perceptron training algorithm) as hypothesis and performs incremental updates to its current hypothesis
  • 9.
     Initialize theweights w1,…….wn to 1  Both winnow and perceptron algorithm uses the same classification scheme  The winnowing algorithms differs form the perceptron algorithm in its updating scheme.  When misclassifying a positive training example x (i.e. a prediction was negative because w.x was too small)  When misclassifying a negative training example x (i.e. Prediction was positive because w.x was too large)
  • 10.
    SPAM Example –each email is a Boolean vector indicating which phase appears and which don’t SPAM if at least one of the phrase in S is present
  • 12.
     Initialize theweights w1, …..wn = 1 on the n variables  Given an example x = (x1,……..xn), output 1 if  Else output 0  If the algorithm makes a mistake:  On positive – if it predicts 0 when f(x)=1, then for each xi equal to 1, double the value of wi  On negative – if it predicts 1 when f(x)=0, then for each xi equal to 1 cut the value of wi in half
  • 14.
     The principleof maximum entropy states that, subject to precisely stated prior data, the probability distribution which best represents the current state of knowledge is the one with the largest entropy.  Commonly used in Natural Language Processing, speech and Information Retrieval  What is maximum entropy classifier?  Probabilistic classifier which belongs to the class of exponential models  Does not assume the features that are conditionally independent of each other  Based on the principle of maximum entropy and forms all models that fit our training data and selects the one which has the largest entropy
  • 15.
     A pieceof information is testable if it can be determined whether a given distribution is consistent with it  The expectation of variable x is 2.87  And p2 + p3 > 0.6  Are statements of testable information  Maximum entropy procedure consist of seeking the probability distribution which maximizes information entropy, subject to constrains of the information.  Entropy maximization takes place under a single constrain: the sum of probabilities must be one
  • 16.
     When touse maximum entropy?  Since it makes minimum assumptions, we use it when we don’t know about the prior distribution  Used when we cannot assume conditional independence of the features  The principle of maximum entropy is commonly applied in two ways to inferential problems  Prior Probabilities: its often used to obtain prior probability distribution for Bayesian inference  Maximum Entropy Models: involved in model specifications which are widely used in natural language processing. Ex. Logistic regression