Your SlideShare is downloading. ×
Random Forest and KNN is fun
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Random Forest and KNN is fun

1,030

Published on

Fun slides to understand random forest and KNN.

Fun slides to understand random forest and KNN.

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,030
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
41
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Random Forest and K Nearest Neighbor
  • 2. K Nearest Neighbor (KNN)
  • 3. Logic of KNN  Find from historical record that looks as similar as possible to the new record. Which group will I be classified?
  • 4. KNN instances and distance measure  Each instance/samples is categorized as a vector of numbers, so all instances correspond to points in an n- dimensional Euclidean space. North Carolina state bird: p = (p1, p2,..., pn) Dinosaur: q = (q1, q2,..., qn)  How to measure the distance between instances? Euclidean distance:
  • 5. K nearest neighbor  You have k nearest neighbors and you need to pick k to get the classification – 1, 3, 5 are people often pick. Question: Why is number of nearest neighbors often odd number? Answer: because the classification is decided by majority vote!
  • 6. Random Forest Random Forest is an ensemble of many decision trees.
  • 7. Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund MarSt TaxInc YESNO NO NO Yes No Single, Divorced < 80K > 80K Splitting Attributes Training Data Decision Tree http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 8. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree. http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 9. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 10. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 11. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 12. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 13. Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No” http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
  • 14. Special feature of decision tree of random forest Trees should not be pruned. Each individual tree is over fitting (not generalized well), but it will be okay after taking the majority vote (which will be explained later). Persecuting a tree is NOT allowed in the random forest world!
  • 15. Logic of ensemble  High-dimensional pattern reorganization problem is as complicated as an elephant to a blind man – too many perspectives to touch and to know! A single decision tree is like a single blind man. It is subject to over fitting and unstab “Unstable” means that small changes in the training set leads to large changes in Predictions.
  • 16. The logic of ensemble - continued A single blind man is limited. Why not send many blind men and let them to investigate the elephant from different perspectives, and then aggregate their opinion? The MANY blind men approach is like random forest, an ensemble of many trees! In random forest, each tree is like a blind man and they will use the training set (the part of the elephant they touched) to draw conclusions (build the training model) and then to make
  • 17. Translating it to a little bit jargon….  Random forest is an ensemble classifier of many decision trees.  Each tree casts a vote at its terminal nodes. (For binary endpoint, the vote will be “YES” or “NO”.)  The final decision of prediction depends on the majority vote of trees.  The motivation for generating multiple trees is to increase predictive accuracy.
  • 18. Need to get some ensemble rules….  To avoid a blind men to announce an elephant is like a carpet, there must be some rules so that their votes make as much sense as they can in aggregation. elephant (hair) carpet
  • 19. Boostrap (randomness by the samples) Bootstrap sampling: create new training sets by random sampling from original data WITH replacement. Dataset Bootstrap Dataset1 Bootstrap Dataset2 Bootstrap Dataset 3 OOB samples (around 1/3) OOB samples (around 1/3) OOB samples (around 1/3) Bootstrap data (about 2/3 of training data) is to grow the tree and OOB samples is for self testing – to evaluate the performance of each tree and to get unbiased estimate of classification error. Bootstrap data is the mainstream random forest. People some times use sampling without replacement. . . . .
  • 20. Random subspace (randomness by features)  For a bootstrap samples with M predictors, at each node, m (m<M) variables are selected at random and only those m features are considered for splitting. This is to let trees grow using different features, like letting each blind men see the data from different perspectives.  Find the best split on the selected m variables.  The value of m is fixed when the forest is grown.
  • 21. How to classify new objects using random forest? Put the input vector on each of the trees in the forest. Each tree gives a classification (a vote) and the forest chooses the classification having the majority votes (over all the trees in the forest). New sample Tree 3 New sample Tree 2 New sample Tree 1 New sample Tree 4 New sample Tree n Final decision – majority vote
  • 22. Review in stats language  Definition: Random forest is learning ensemble consisting of bagging (or other type of re-sampling) of un-pruned decision tree learners with a randomized selection of features at each split.  Random forest algorithm  Let Ntrees be the number of trees to build  for each of Ntrees iterations  1. Select a new bootstrap (or other type of re-sampling) sample from training set  2. Grow an un-pruned tree on this bootstrap.  3. At each internal node, randomly select mtry predictors and determine the best split using only these predictors.  4. Do not perform cost complexity pruning. Save tree as is, along side those built thus far.  Output overall prediction as the average response (regression) or majority vote (classification) from all individually trained trees http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
  • 23. Pattern recognition is fun Lunar mining robot "Give me a place to stand on, and I will move the Earth with a lever .” – Archimedes Give the machine enough data and algorithm, he/she will behave similar like you. Mars Rover

×