Machine Learning Methods Comparison: Decision Trees, Softmax Regression, Ensemble Techniques

DECISION TREE, SOFTMAX
REGRESSION AND ENSEMBLE
METHODS IN MACHINE LEARNING
- Abhishek Vijayvargia

WHAT IS MACHINE LEARNING
 Formal Approach
 Filed of study that gives computers the ability to learn
without explicitly programmed.
 Informal Approach

MACHINE LEARNING
 Supervised Learning
 Supervised learning is the machine learning task of
inferring a function from labeled training data.
 Approximation
 Unsupervised Learning
 Trying to find hidden structure in unlabeled data.
 Examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution.
 Shorter Description
 Reinforcement learning
 Learning by interacting with an environment

SUPERVISED LEARNING
 Classification
 Output variable takes class labels.
 Ex. Predicting a mail is spam/ham
 Regression
 Output variable is numeric or continuous.
 Ex. Measuring temperature

DECISION TREES
 Is this restaurant good?
 ( YES/ NO)

DECISION TREES
 What are the factors which decide that restaurant is
good for you or not?
 Type : Italian, South Indian, French
 Atmosphere: Casual, Fancy
 How many people inside it? (10< people > 30 )
 Cost
 Weather outside : Rainy, Sunny, Cloudy
 Hungry : Yes/No

DECISION TREE
Hungry
True False
Rainy
People >
10
YES No
YES
Type
Cost
YES No
No
True
False
True
False
French South Indian
More
Less

DECISION TREE LEARNING
 Pick best attribute
 Make a decision tree node containing that attribute
 For each value of decision node create a
descendent of node
 Sort training example to leaves
 Iterate on subsets using remaining attributes

DECISION TREE : PICK BEST ATTRIBUTE
True
+ + -
+ + - -
+ -+-
False
- - + -
+ - +
+ - - +
True
+ - + -
+ + +
- - - +
False
True
+ + +
+ +
False
- - - -
- - -
Graph. 1 Graph. 2 Graph. 3

DECISION TREE : PICK BEST ATTRIBUTE
 Select the attribute which gives MAXIMUM Information
Gain.
 Gain measures how well a given attribute separates
training examples into targeted classes.
 Entropy is a measure of the amount of uncertainty in the
(data) set.
H(S) = − 푥∈푋 푝(푥) log2 푝(푥)
S: Current data set for which entropy is calculated.
X: Set of classes in X.
p(x) : The proportion of the number of elements in class to
the number of elements in set.

DECISION TREE : INFORMATION GAIN
 Information gain IG(A) is the measure of the
difference in entropy from before to after the set S
is split on an attribute A.
 In other words, how much uncertainty in S was
reduced after splitting set S on attribute A.
IG(A,S) = H(S) - 푡∈푇 푝 푡 퐻(푡)
H(S) : Entropy of set S
T : The subsets created from splitting set S by
attribute A such that S = 푡∈푇 푡
p(t) : The proportion of the number of elements in t to
the number of elements in set S

DECISION TREE ALGORITHM : BIAS
 Restriction Bias : All type of possible decision tree.
 Preference Bias : Which decision tree algorithm
prefer?
 Good split at TOP
 Correct over Incorrect
 Shorter tree

DECISION TREE : CONTINUOUS ATTRIBUTE
 Branch on number of possible values?
 Include age only in training set?
 Useless when we get some age not present in training
set
 Represent in the form of range
Age
1.11 1.111
20<=Age<30

DECISION TREE : CONTINUOUS ATTRIBUTE
 Does it make sense to repeat an attribute along a
path in the tree?
B
A A
A
B
A

DECISION TREE : WHEN DO WE STOP?
 Everything classified correctly? (same example/
noisy two answer for same)
 No more attribute? ( not good for continuous
attribute/ infinite possibility)
 Pruning

SOFTMAX REGRESSION
 Softmax Regression ( or multinomial logistic
regression) is a classification method that
generalizes logistic regression to multiclass
problems. (i.e. with more than two possible discrete
outcomes.)
 Used to predict the probabilities of the different
possible outcomes of a categorically distributed
dependent variable, given a set of independent
variables (which may be real-valued, binary-valued,
categorical-valued, etc.).

LOGISTIC REGRESSION
 Logistic regression is used to refer specifically to
the problem in which the dependent variable is
binary ( only two categories).
 As output variable y ∈ 0,1 , it seems natural to
choose Bernoulli family of distribution to model
conditional distribution of y given x.
 Logistic function (which always takes on values
between zero and one)
퐹 푡 = 1
1+푒−푡 = 1
푒−휃푇푥

SOFTMAX REGRESSION
 Used in classification problem in which response
variable y can take on any one of k values.
 푦 ∈ 1,2, … , 푘 .
 Ex. Classify emails into three classes { Primary,
Social, Promotions }
 Response variable is still discrete but can take
more than two values.
 To derive General Linear Model for multinomial data
we begin by expressing the multinomial as an
exponential family distribution.

SOFTMAX REGRESSION
 To parameterize a multinomial over k-possible
outcomes, we could use k parameters ∅1, … , ∅푘
specifying probability of each outcomes.
푘 ∅푖 =
 These parameters are redundant because 푖=1
1. So ∅푖 = 푝 푦 = 푖; ∅
푘 ∅푖
 and 푝(푦 = 푘; ∅) = 1 − 푖=1
 Indicator Function 1{.} takes a value of 1 if it’s
argument is true, and 0 otherwise.
 1{True} = 1, 1{False} = 0.

SOFTMAX REGRESSION
 Multinomial is member of exponential family.
1{푦=1} ∅2
푝 푦; ∅ = ∅1
1{푦=2} … … . ∅푘
1{푦=푘}
1{푦=1} ∅2
= ∅1
1− 푖=1
1{푦=2} … … . ∅푘
푘−1{푦=푖}
=푏 푦 exp 휔푇 푇 푦 − a ω
Where 휔 =
log ∅ 1 ∅푘
log ∅ 2 ∅푘
⋮
log ∅ 푘 − 1 ∅푘
푎 휔 = − log ∅푘
푏 푦 = 1 푇 푦 ∈ 푅푘
_1

SOFTMAX REGRESSION
 The link function is given as
휔푖 = log
∅푖
∅푘
To invert the link function and derive the response
function
푒휔푖 =
∅푖
∅푘
∅푘푒휔푖 = ∅푖
∅푘
푘
푖=1
푒휔푖 =
푘
푖=1
∅푖 = 1

SOFTMAX REGRESSION
 So we get ∅푘= 1
푘 푒휔
푖=1
푖
we can substitute it back in
the equation to give response function
∅푖=
푒휔
푖
푘 푒휔
푖=1
푖
 Conditional distribution of y given x is given by
푝 푦 = 푖 푥; 휃 = 휔푖
=
푒휔
푖
푘 푒휔
푖=1
푖
=
푒−휃푖푇
푥
푖
푘 푒−휃푇
푖=1
푥
푖

SOFTMAX REGRESSION
 Softmax regression is a generalization of logistic
regression.
 Our Hypothesis will output
ℎ휃 푥 =
∅1
∅2
⋮
∅푘
 In other words, our hypothesis will output the
estimated probability 푝 푦 = 푖 푥; 휃 for every value of
i = 1, .. k.

ENSEMBLE LEARNING
 Ensemble learning use multiple learning algorithms
to obtain better predictive performance than could
be obtained from any of the constituent learning
algorithms.
 Ensemble learning is primarily used to improve the
prediction performance of a model, or reduce the
likelihood of an unfortunate selection of a poor one.

HOW GOOD ARE ENSEMBLES?
 Let’s look at NetFlix Prize Competition…

NETFLIX PRIZE : STARTED IN OCT 2006
 Supervised Learning Task
 Training Data is a set of users and rating (1,2,3,4,5
stars) those users have given to movies.
 Construct a classifier that given a user and an unrated
movie, correctly classified that movie as either 1,2,3,4 or
5 stars.
 $1 Million prize for a 10% improvement over Netflix
current movie recommender/Classifier.

ENSEMBLE LEARNING : GENERAL IDEA

ENSEMBLE LEARNING : BAGGING
 Given :
 Training Set of N examples
 A class of learning models ( decision tree, NB, SVM,RF
etc. )
 Training :
 At each iteration I a training set Si of N tuples is
sampled with replacement from S.
 A classifier model Mi is learned for each training set Si.
 Classification : Classify an unknown sample x
 Each classifier Mi returns it’s class prediction.
 The bagged classifier M* count the votes and assign the
class with the most votes.

ENSEMBLE LEARNING : BAGGING
 Bagging reduces variance by voting/averaging.
 Can help a lot when data is noisy.
 If learning algorithm is unstable, then Bagging
almost always improves performance.

ENSEMBLE LEARNING : RANDOM FORESTS
 Random Forests grows many classification trees.
 To classify a new object from an input vector, put
the input vector down each of the trees in the
forest.
 Each tree gives a classification, and we say the tree
"votes" for that class.
 The forest chooses the classification having the
most votes (over all the trees in the forest).

ENSEMBLE LEARNING : RANDOM FORESTS
 Each tree is grown as follows:
 If the number of cases in the training set is N,
sample N cases at random - but with replacement,
from the original data. This sample will be the
training set for growing the tree.
 If there are M input variables, a number m<<M is
specified such that at each node, m variables are
selected at random out of the M and the best split
on these m is used to split the node. The value of m
is held constant during the forest growing.
 Each tree is grown to the largest extent possible.
There is no pruning.

FEATURES OF RANDOM FORESTS
 Better in accuracy among current algorithms.
 Runs efficiently on large data bases.
 It can handle thousands of input variables without
variable deletion.
 It gives estimates of what variables are important in
the classification.
 Effective method for estimating missing data and
maintains accuracy when a large proportion of the
data are missing.
 Generated forests can be saved for future use on
other data.

ENSEMBLE LEARNING : BOOSTING
 Create a sequence of classifiers, giving higher
influence to more accurate classifiers.
 At each iteration, make examples currently
misclassified more important( get larger weight in
the construction of the next classifier)
 Then combine classifier by weighted vote (weight
given by classifier accuracy)

 Suppose there are just 7 training examples
{1,2,3,4,5,6,7}
 Initially each example has a 0.142 (1/7) probability of
being sampled.
 1st round of boosting samples ( with replacement) 7
examples { 3,5,5,4,6,7,3} and build a classifier from
them.
 Suppose examples {2,3,4,6,7} are correctly predicted by
this classifier and examples {1,5} are wrongly predicted:
 Weight of examples {1,5} are increased.
 Weight of examples {2,3,4,6,7} are decreased.
 2nd round of boosting again take 7 examples, but now
examples {1,5} are more likely to be sampled.
 And so on until some convergence is achieved.

 Weights models according to performance.
 Encourage new model to become an “expert” for
instances misclassified by earlier model.
 Combines “Weak Learner” to generate “strong
learner”.

ENSEMBLE LEARNING
 Netflix 1st prize winner gradient boosted decision
tree.
 http://www.netflixprize.com/assets/GrandPrize2009
_BPC_BellKor.pdf

 Ask Question to narrow down possiblity
 Informatica building example
 Mango machine learning
 Cannot look all trees

Machine Learning Methods Comparison: Decision Trees, Softmax Regression, Ensemble Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Machine Learning Methods Comparison: Decision Trees, Softmax Regression, Ensemble Techniques

Similar to Machine Learning Methods Comparison: Decision Trees, Softmax Regression, Ensemble Techniques (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Methods Comparison: Decision Trees, Softmax Regression, Ensemble Techniques