Random Forest and K Nearest Neighbor
K Nearest Neighbor (KNN)
Logic of KNN
 Find from historical record that looks as similar as
possible to the new record.
Which group will I be classified?
KNN instances and distance
measure
 Each instance/samples is categorized as a vector of
numbers, so all instances correspond to points in an n-
dimensional Euclidean space.
North Carolina state bird: p = (p1, p2,..., pn)
Dinosaur: q = (q1, q2,..., qn)
 How to measure the distance between instances?
Euclidean distance:
K nearest neighbor
 You have k nearest neighbors and you need to
pick k to get the classification – 1, 3, 5 are
people often pick.
Question: Why is number of nearest neighbors often odd number?
Answer: because the classification is decided by majority vote!
Random Forest
Random Forest is an ensemble of many decision
trees.
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Decision Tree
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
Special feature of decision tree of random
forest
Trees should not be
pruned.
Each individual tree is
over fitting (not
generalized well), but it
will be okay after taking
the majority vote (which
will be explained later).
Persecuting a tree is NOT allowed
in the random forest world!
Logic of ensemble
 High-dimensional pattern reorganization problem is as
complicated as an elephant to a blind man – too many
perspectives to touch and to know!
A single decision tree is like a single blind man. It is subject to over fitting and unstab
“Unstable” means that small changes in the training set leads to large changes in
Predictions.
The logic of ensemble - continued
A single blind man is limited. Why
not send many blind men and let
them to investigate the elephant
from different perspectives, and
then aggregate their opinion?
The MANY blind men approach is
like random forest, an ensemble
of many trees!
In random forest, each tree is like a blind man and they will use the training set
(the part of
the elephant they touched) to draw conclusions (build the training model) and
then to make
Translating it to a little bit jargon….
 Random forest is an ensemble classifier of many
decision trees.
 Each tree casts a vote at its terminal nodes. (For
binary endpoint, the vote will be “YES” or “NO”.)
 The final decision of prediction depends on the
majority vote of trees.
 The motivation for generating multiple trees is to
increase predictive accuracy.
Need to get some ensemble rules….
 To avoid a blind men to announce an elephant is like a
carpet, there must be some rules so that their votes
make as much sense as they can in aggregation.
elephant (hair) carpet
Boostrap (randomness by the samples)
Bootstrap sampling: create new training sets by random sampling from
original data WITH replacement.
Dataset
Bootstrap Dataset1
Bootstrap Dataset2
Bootstrap Dataset 3
OOB samples (around 1/3)
OOB samples (around 1/3)
OOB samples (around 1/3)
Bootstrap data (about 2/3 of training data) is to grow the tree and OOB
samples is for self testing – to evaluate the performance of each tree and to
get unbiased estimate of classification error.
Bootstrap data is the mainstream random forest. People some times use
sampling without replacement.
.
.
.
.
Random subspace (randomness by features)
 For a bootstrap samples with M
predictors, at each node, m (m<M)
variables are selected at random
and only those m features are
considered for splitting. This is to
let trees grow using different
features, like letting each blind
men see the data from different
perspectives.
 Find the best split on the selected
m variables.
 The value of m is fixed when the
forest is grown.
How to classify new objects using random
forest?
Put the input vector on each of the trees in the forest. Each tree gives a
classification (a vote) and the forest chooses the classification having the
majority votes (over all the trees in the forest).
New sample
Tree 3
New sample
Tree 2
New sample
Tree 1
New sample
Tree 4
New sample
Tree n
Final decision – majority vote
Review in stats language
 Definition: Random forest is learning ensemble consisting of bagging
(or other type of re-sampling) of un-pruned decision tree learners
with a randomized selection of features at each split.
 Random forest algorithm
 Let Ntrees be the number of trees to build
 for each of Ntrees iterations
 1. Select a new bootstrap (or other type of re-sampling) sample from
training set
 2. Grow an un-pruned tree on this bootstrap.
 3. At each internal node, randomly select mtry predictors and
determine the best split using only these predictors.
 4. Do not perform cost complexity pruning. Save tree as is, along side
those built thus far.
 Output overall prediction as the average response (regression) or
majority vote (classification) from all individually trained trees
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
Pattern recognition is fun
Lunar mining robot
"Give me a place to stand on, and I will move the Earth with a lever .” – Archimedes
Give the machine enough data and algorithm, he/she will behave similar like you.
Mars Rover

Random Forest and KNN is fun

  • 1.
    Random Forest andK Nearest Neighbor
  • 2.
  • 3.
    Logic of KNN Find from historical record that looks as similar as possible to the new record. Which group will I be classified?
  • 4.
    KNN instances anddistance measure  Each instance/samples is categorized as a vector of numbers, so all instances correspond to points in an n- dimensional Euclidean space. North Carolina state bird: p = (p1, p2,..., pn) Dinosaur: q = (q1, q2,..., qn)  How to measure the distance between instances? Euclidean distance:
  • 5.
    K nearest neighbor You have k nearest neighbors and you need to pick k to get the classification – 1, 3, 5 are people often pick. Question: Why is number of nearest neighbors often odd number? Answer: because the classification is decided by majority vote!
  • 6.
    Random Forest Random Forestis an ensemble of many decision trees.
  • 7.
    Example of aDecision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund MarSt TaxInc YESNO NO NO Yes No Single, Divorced < 80K > 80K Splitting Attributes Training Data Decision Tree http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 8.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree. http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 9.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 10.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 11.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 12.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 13.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No” http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 14.
    Special feature ofdecision tree of random forest Trees should not be pruned. Each individual tree is over fitting (not generalized well), but it will be okay after taking the majority vote (which will be explained later). Persecuting a tree is NOT allowed in the random forest world!
  • 15.
    Logic of ensemble High-dimensional pattern reorganization problem is as complicated as an elephant to a blind man – too many perspectives to touch and to know! A single decision tree is like a single blind man. It is subject to over fitting and unstab “Unstable” means that small changes in the training set leads to large changes in Predictions.
  • 16.
    The logic ofensemble - continued A single blind man is limited. Why not send many blind men and let them to investigate the elephant from different perspectives, and then aggregate their opinion? The MANY blind men approach is like random forest, an ensemble of many trees! In random forest, each tree is like a blind man and they will use the training set (the part of the elephant they touched) to draw conclusions (build the training model) and then to make
  • 17.
    Translating it toa little bit jargon….  Random forest is an ensemble classifier of many decision trees.  Each tree casts a vote at its terminal nodes. (For binary endpoint, the vote will be “YES” or “NO”.)  The final decision of prediction depends on the majority vote of trees.  The motivation for generating multiple trees is to increase predictive accuracy.
  • 18.
    Need to getsome ensemble rules….  To avoid a blind men to announce an elephant is like a carpet, there must be some rules so that their votes make as much sense as they can in aggregation. elephant (hair) carpet
  • 19.
    Boostrap (randomness bythe samples) Bootstrap sampling: create new training sets by random sampling from original data WITH replacement. Dataset Bootstrap Dataset1 Bootstrap Dataset2 Bootstrap Dataset 3 OOB samples (around 1/3) OOB samples (around 1/3) OOB samples (around 1/3) Bootstrap data (about 2/3 of training data) is to grow the tree and OOB samples is for self testing – to evaluate the performance of each tree and to get unbiased estimate of classification error. Bootstrap data is the mainstream random forest. People some times use sampling without replacement. . . . .
  • 20.
    Random subspace (randomnessby features)  For a bootstrap samples with M predictors, at each node, m (m<M) variables are selected at random and only those m features are considered for splitting. This is to let trees grow using different features, like letting each blind men see the data from different perspectives.  Find the best split on the selected m variables.  The value of m is fixed when the forest is grown.
  • 21.
    How to classifynew objects using random forest? Put the input vector on each of the trees in the forest. Each tree gives a classification (a vote) and the forest chooses the classification having the majority votes (over all the trees in the forest). New sample Tree 3 New sample Tree 2 New sample Tree 1 New sample Tree 4 New sample Tree n Final decision – majority vote
  • 22.
    Review in statslanguage  Definition: Random forest is learning ensemble consisting of bagging (or other type of re-sampling) of un-pruned decision tree learners with a randomized selection of features at each split.  Random forest algorithm  Let Ntrees be the number of trees to build  for each of Ntrees iterations  1. Select a new bootstrap (or other type of re-sampling) sample from training set  2. Grow an un-pruned tree on this bootstrap.  3. At each internal node, randomly select mtry predictors and determine the best split using only these predictors.  4. Do not perform cost complexity pruning. Save tree as is, along side those built thus far.  Output overall prediction as the average response (regression) or majority vote (classification) from all individually trained trees http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
  • 23.
    Pattern recognition isfun Lunar mining robot "Give me a place to stand on, and I will move the Earth with a lever .” – Archimedes Give the machine enough data and algorithm, he/she will behave similar like you. Mars Rover