Upcoming SlideShare
×

# Random Forest and KNN is fun

1,421 views
1,309 views

Published on

Fun slides to understand random forest and KNN.

Published in: Technology, News & Politics
4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,421
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
54
0
Likes
4
Embeds 0
No embeds

No notes for slide

### Random Forest and KNN is fun

1. 1. Random Forest and K Nearest Neighbor
2. 2. K Nearest Neighbor (KNN)
3. 3. Logic of KNN  Find from historical record that looks as similar as possible to the new record. Which group will I be classified?
4. 4. KNN instances and distance measure  Each instance/samples is categorized as a vector of numbers, so all instances correspond to points in an n- dimensional Euclidean space. North Carolina state bird: p = (p1, p2,..., pn) Dinosaur: q = (q1, q2,..., qn)  How to measure the distance between instances? Euclidean distance:
5. 5. K nearest neighbor  You have k nearest neighbors and you need to pick k to get the classification – 1, 3, 5 are people often pick. Question: Why is number of nearest neighbors often odd number? Answer: because the classification is decided by majority vote!
6. 6. Random Forest Random Forest is an ensemble of many decision trees.
7. 7. Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund MarSt TaxInc YESNO NO NO Yes No Single, Divorced < 80K > 80K Splitting Attributes Training Data Decision Tree http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
8. 8. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree. http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
9. 9. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
10. 10. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
11. 11. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
12. 12. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
13. 13. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No” http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
14. 14. Special feature of decision tree of random forest Trees should not be pruned. Each individual tree is over fitting (not generalized well), but it will be okay after taking the majority vote (which will be explained later). Persecuting a tree is NOT allowed in the random forest world!
15. 15. Logic of ensemble  High-dimensional pattern reorganization problem is as complicated as an elephant to a blind man – too many perspectives to touch and to know! A single decision tree is like a single blind man. It is subject to over fitting and unstab “Unstable” means that small changes in the training set leads to large changes in Predictions.
16. 16. The logic of ensemble - continued A single blind man is limited. Why not send many blind men and let them to investigate the elephant from different perspectives, and then aggregate their opinion? The MANY blind men approach is like random forest, an ensemble of many trees! In random forest, each tree is like a blind man and they will use the training set (the part of the elephant they touched) to draw conclusions (build the training model) and then to make
17. 17. Translating it to a little bit jargon….  Random forest is an ensemble classifier of many decision trees.  Each tree casts a vote at its terminal nodes. (For binary endpoint, the vote will be “YES” or “NO”.)  The final decision of prediction depends on the majority vote of trees.  The motivation for generating multiple trees is to increase predictive accuracy.
18. 18. Need to get some ensemble rules….  To avoid a blind men to announce an elephant is like a carpet, there must be some rules so that their votes make as much sense as they can in aggregation. elephant (hair) carpet
19. 19. Boostrap (randomness by the samples) Bootstrap sampling: create new training sets by random sampling from original data WITH replacement. Dataset Bootstrap Dataset1 Bootstrap Dataset2 Bootstrap Dataset 3 OOB samples (around 1/3) OOB samples (around 1/3) OOB samples (around 1/3) Bootstrap data (about 2/3 of training data) is to grow the tree and OOB samples is for self testing – to evaluate the performance of each tree and to get unbiased estimate of classification error. Bootstrap data is the mainstream random forest. People some times use sampling without replacement. . . . .
20. 20. Random subspace (randomness by features)  For a bootstrap samples with M predictors, at each node, m (m<M) variables are selected at random and only those m features are considered for splitting. This is to let trees grow using different features, like letting each blind men see the data from different perspectives.  Find the best split on the selected m variables.  The value of m is fixed when the forest is grown.
21. 21. How to classify new objects using random forest? Put the input vector on each of the trees in the forest. Each tree gives a classification (a vote) and the forest chooses the classification having the majority votes (over all the trees in the forest). New sample Tree 3 New sample Tree 2 New sample Tree 1 New sample Tree 4 New sample Tree n Final decision – majority vote
22. 22. Review in stats language  Definition: Random forest is learning ensemble consisting of bagging (or other type of re-sampling) of un-pruned decision tree learners with a randomized selection of features at each split.  Random forest algorithm  Let Ntrees be the number of trees to build  for each of Ntrees iterations  1. Select a new bootstrap (or other type of re-sampling) sample from training set  2. Grow an un-pruned tree on this bootstrap.  3. At each internal node, randomly select mtry predictors and determine the best split using only these predictors.  4. Do not perform cost complexity pruning. Save tree as is, along side those built thus far.  Output overall prediction as the average response (regression) or majority vote (classification) from all individually trained trees http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
23. 23. Pattern recognition is fun Lunar mining robot "Give me a place to stand on, and I will move the Earth with a lever .” – Archimedes Give the machine enough data and algorithm, he/she will behave similar like you. Mars Rover