Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE, SOFTMAX
REGRESSION AND ENSEMBLE
METHODS IN MACHINE LEARNING
- Abhishek Vijayvargia

WHAT IS MACHINE LEARNING
 Formal Approach
 Filed of study that gives computers the ability to learn
without explicitly programmed.
 Informal Approach

MACHINE LEARNING
 Supervised Learning
 Supervised learning is the machine learning task of
inferring a function from labeled training data.
 Approximation
 Unsupervised Learning
 Trying to find hidden structure in unlabeled data.
 Examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution.
 Shorter Description
 Reinforcement learning
 Learning by interacting with an environment

SUPERVISED LEARNING
 Classification
 Output variable takes class labels.
 Ex. Predicting a mail is spam/ham
 Regression
 Output variable is numeric or continuous.
 Ex. Measuring temperature

DECISION TREES
 Is this restaurant good?
 ( YES/ NO)

DECISION TREES
 What are the factors which decide that restaurant is
good for you or not?
 Type : Italian, South Indian, French
 Atmosphere: Casual, Fancy
 How many people inside it? (10< people > 30 )
 Cost
 Weather outside : Rainy, Sunny, Cloudy
 Hungry : Yes/No

DECISION TREE
Hungry
True False
Rainy
People >
10
YES No
YES
Type
Cost
YES No
No
True
False
True
False
French South Indian
More
Less

DECISION TREE LEARNING
 Pick best attribute
 Make a decision tree node containing that attribute
 For each value of decision node create a
descendent of node
 Sort training example to leaves
 Iterate on subsets using remaining attributes

DECISION TREE : PICK BEST ATTRIBUTE
True
+ + -
+ + - -
+ -+-
False
- - + -
+ - +
+ - - +
True
+ - + -
+ + +
- - - +
False
True
+ + +
+ +
False
- - - -
- - -
Graph. 1 Graph. 2 Graph. 3

DECISION TREE : PICK BEST ATTRIBUTE
 Select the attribute which gives MAXIMUM Information
Gain.
 Gain measures how well a given attribute separates
training examples into targeted classes.
 Entropy is a measure of the amount of uncertainty in the
(data) set.
H(S) = − 푥∈푋 푝(푥) log2 푝(푥)
S: Current data set for which entropy is calculated.
X: Set of classes in X.
p(x) : The proportion of the number of elements in class to
the number of elements in set.

DECISION TREE : INFORMATION GAIN
 Information gain IG(A) is the measure of the
difference in entropy from before to after the set S
is split on an attribute A.
 In other words, how much uncertainty in S was
reduced after splitting set S on attribute A.
IG(A,S) = H(S) - 푡∈푇 푝 푡 퐻(푡)
H(S) : Entropy of set S
T : The subsets created from splitting set S by
attribute A such that S = 푡∈푇 푡
p(t) : The proportion of the number of elements in t to
the number of elements in set S

DECISION TREE ALGORITHM : BIAS
 Restriction Bias : All type of possible decision tree.
 Preference Bias : Which decision tree algorithm
prefer?
 Good split at TOP
 Correct over Incorrect
 Shorter tree

DECISION TREE : CONTINUOUS ATTRIBUTE
 Branch on number of possible values?
 Include age only in training set?
 Useless when we get some age not present in training
set
 Represent in the form of range
Age
1.11 1.111
20<=Age<30

DECISION TREE : CONTINUOUS ATTRIBUTE
 Does it make sense to repeat an attribute along a
path in the tree?
B
A A
A
B
A

DECISION TREE : WHEN DO WE STOP?
 Everything classified correctly? (same example/
noisy two answer for same)
 No more attribute? ( not good for continuous
attribute/ infinite possibility)
 Pruning

SOFTMAX REGRESSION
 Softmax Regression ( or multinomial logistic
regression) is a classification method that
generalizes logistic regression to multiclass
problems. (i.e. with more than two possible discrete
outcomes.)
 Used to predict the probabilities of the different
possible outcomes of a categorically distributed
dependent variable, given a set of independent
variables (which may be real-valued, binary-valued,
categorical-valued, etc.).

LOGISTIC REGRESSION
 Logistic regression is used to refer specifically to
the problem in which the dependent variable is
binary ( only two categories).
 As output variable y ∈ 0,1 , it seems natural to
choose Bernoulli family of distribution to model
conditional distribution of y given x.
 Logistic function (which always takes on values
between zero and one)
퐹 푡 = 1
1+푒−푡 = 1
푒−휃푇푥

SOFTMAX REGRESSION
 Used in classification problem in which response
variable y can take on any one of k values.
 푦 ∈ 1,2, … , 푘 .
 Ex. Classify emails into three classes { Primary,
Social, Promotions }
 Response variable is still discrete but can take
more than two values.
 To derive General Linear Model for multinomial data
we begin by expressing the multinomial as an
exponential family distribution.

SOFTMAX REGRESSION
 To parameterize a multinomial over k-possible
outcomes, we could use k parameters ∅1, … , ∅푘
specifying probability of each outcomes.
푘 ∅푖 =
 These parameters are redundant because 푖=1
1. So ∅푖 = 푝 푦 = 푖; ∅
푘 ∅푖
 and 푝(푦 = 푘; ∅) = 1 − 푖=1
 Indicator Function 1{.} takes a value of 1 if it’s
argument is true, and 0 otherwise.
 1{True} = 1, 1{False} = 0.

SOFTMAX REGRESSION
 Multinomial is member of exponential family.
1{푦=1} ∅2
푝 푦; ∅ = ∅1
1{푦=2} … … . ∅푘
1{푦=푘}
1{푦=1} ∅2
= ∅1
1− 푖=1
1{푦=2} … … . ∅푘
푘−1{푦=푖}
=푏 푦 exp 휔푇 푇 푦 − a ω
Where 휔 =
log ∅ 1 ∅푘
log ∅ 2 ∅푘
⋮
log ∅ 푘 − 1 ∅푘
푎 휔 = − log ∅푘
푏 푦 = 1 푇 푦 ∈ 푅푘
_1

SOFTMAX REGRESSION
 The link function is given as
휔푖 = log
∅푖
∅푘
To invert the link function and derive the response
function
푒휔푖 =
∅푖
∅푘
∅푘푒휔푖 = ∅푖
∅푘
푘
푖=1
푒휔푖 =
푘
푖=1
∅푖 = 1

SOFTMAX REGRESSION
 So we get ∅푘= 1
푘 푒휔
푖=1
푖
we can substitute it back in
the equation to give response function
∅푖=
푒휔
푖
푘 푒휔
푖=1
푖
 Conditional distribution of y given x is given by
푝 푦 = 푖 푥; 휃 = 휔푖
=
푒휔
푖
푘 푒휔
푖=1
푖
=
푒−휃푖푇
푥
푖
푘 푒−휃푇
푖=1
푥
푖

SOFTMAX REGRESSION
 Softmax regression is a generalization of logistic
regression.
 Our Hypothesis will output
ℎ휃 푥 =
∅1
∅2
⋮
∅푘
 In other words, our hypothesis will output the
estimated probability 푝 푦 = 푖 푥; 휃 for every value of
i = 1, .. k.

ENSEMBLE LEARNING
 Ensemble learning use multiple learning algorithms
to obtain better predictive performance than could
be obtained from any of the constituent learning
algorithms.
 Ensemble learning is primarily used to improve the
prediction performance of a model, or reduce the
likelihood of an unfortunate selection of a poor one.

HOW GOOD ARE ENSEMBLES?
 Let’s look at NetFlix Prize Competition…

NETFLIX PRIZE : STARTED IN OCT 2006
 Supervised Learning Task
 Training Data is a set of users and rating (1,2,3,4,5
stars) those users have given to movies.
 Construct a classifier that given a user and an unrated
movie, correctly classified that movie as either 1,2,3,4 or
5 stars.
 $1 Million prize for a 10% improvement over Netflix
current movie recommender/Classifier.

ENSEMBLE LEARNING : GENERAL IDEA

ENSEMBLE LEARNING : BAGGING
 Given :
 Training Set of N examples
 A class of learning models ( decision tree, NB, SVM,RF
etc. )
 Training :
 At each iteration I a training set Si of N tuples is
sampled with replacement from S.
 A classifier model Mi is learned for each training set Si.
 Classification : Classify an unknown sample x
 Each classifier Mi returns it’s class prediction.
 The bagged classifier M* count the votes and assign the
class with the most votes.

ENSEMBLE LEARNING : BAGGING
 Bagging reduces variance by voting/averaging.
 Can help a lot when data is noisy.
 If learning algorithm is unstable, then Bagging
almost always improves performance.

ENSEMBLE LEARNING : RANDOM FORESTS
 Random Forests grows many classification trees.
 To classify a new object from an input vector, put
the input vector down each of the trees in the
forest.
 Each tree gives a classification, and we say the tree
"votes" for that class.
 The forest chooses the classification having the
most votes (over all the trees in the forest).

ENSEMBLE LEARNING : RANDOM FORESTS
 Each tree is grown as follows:
 If the number of cases in the training set is N,
sample N cases at random - but with replacement,
from the original data. This sample will be the
training set for growing the tree.
 If there are M input variables, a number m<<M is
specified such that at each node, m variables are
selected at random out of the M and the best split
on these m is used to split the node. The value of m
is held constant during the forest growing.
 Each tree is grown to the largest extent possible.
There is no pruning.

FEATURES OF RANDOM FORESTS
 Better in accuracy among current algorithms.
 Runs efficiently on large data bases.
 It can handle thousands of input variables without
variable deletion.
 It gives estimates of what variables are important in
the classification.
 Effective method for estimating missing data and
maintains accuracy when a large proportion of the
data are missing.
 Generated forests can be saved for future use on
other data.

ENSEMBLE LEARNING : BOOSTING
 Create a sequence of classifiers, giving higher
influence to more accurate classifiers.
 At each iteration, make examples currently
misclassified more important( get larger weight in
the construction of the next classifier)
 Then combine classifier by weighted vote (weight
given by classifier accuracy)

 Suppose there are just 7 training examples
{1,2,3,4,5,6,7}
 Initially each example has a 0.142 (1/7) probability of
being sampled.
 1st round of boosting samples ( with replacement) 7
examples { 3,5,5,4,6,7,3} and build a classifier from
them.
 Suppose examples {2,3,4,6,7} are correctly predicted by
this classifier and examples {1,5} are wrongly predicted:
 Weight of examples {1,5} are increased.
 Weight of examples {2,3,4,6,7} are decreased.
 2nd round of boosting again take 7 examples, but now
examples {1,5} are more likely to be sampled.
 And so on until some convergence is achieved.

 Weights models according to performance.
 Encourage new model to become an “expert” for
instances misclassified by earlier model.
 Combines “Weak Learner” to generate “strong
learner”.

ENSEMBLE LEARNING
 Netflix 1st prize winner gradient boosted decision
tree.
 http://www.netflixprize.com/assets/GrandPrize2009
_BPC_BellKor.pdf

 Ask Question to narrow down possiblity
 Informatica building example
 Mango machine learning
 Cannot look all trees

Decision tree, softmax regression and ensemble methods in machine learning

More Related Content

What's hot

Viewers also liked

Similar to Decision tree, softmax regression and ensemble methods in machine learning

Recently uploaded

Decision tree, softmax regression and ensemble methods in machine learning