Random Forests R vs Python by Linda Uruchurtu
Upcoming SlideShare
Loading in...5
×
 

Random Forests R vs Python by Linda Uruchurtu

on

  • 1,431 views

Random Forests R vs Python by Linda Uruchurtu

Random Forests R vs Python by Linda Uruchurtu

Statistics

Views

Total Views
1,431
Views on SlideShare
1,431
Embed Views
0

Actions

Likes
4
Downloads
35
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Random Forests R vs Python by Linda Uruchurtu Random Forests R vs Python by Linda Uruchurtu Presentation Transcript

    • RANDOM FORESTS R vs PYTHONR & PYTHON Having fun when starting out in data analysis
    • WHO LINDA URUCHURTU @lindauruchurtu Consultant at DBi Web Analytics & Data Consultancy Physicist by training
    • OUTLINE OF THIS TALK • Motivation • Random Forests: R & Python • Example: EMI music set • Concluding remarks
    • MOTIVATION
    • STARTING OUT IN DATA ANALYSIS • Online: blogs, GitHub, MOOCs, Kaggle, Data Tau, Cross Validated, Stackoverflow... • Books • School work TOO MANY RESOURCES
    • WHICH LANGUAGE SHOULD I USE? POPULAR QUESTION
    • LET’S ASK GOOGLE
    • • Programmed in C • Used MATLAB at Uni • Spent a long time playing with symbolic langs Mathematica & Maple START BY WHAT YOU KNOW & ASK YOUR FRIENDS MY EXPERIENCE P.S. I had not met the iPython notebook.
    • BIG REVEAL: I AM AN AVID R USER MY EXPERIENCE (cont) P.S. I had not met the iPython notebook. • Don’t have a web dev background • Surrounded by people doing Stats • Pick the right tool for the task at hand
    • TL;DR - CAN BE CONFUSING FOR A NEWBIE LANGUAGE WARS Too many articles about: • “Python Displacing R As The Programming Language For Data Analysis” • “Is Python really supplanting R for data work?” • “10 Reasons Python Rocks for Research” • “Why Python is steadily eating other languages' lunch” • “Why I’m betting on Julia” • “What are the advantages of using Python over R?” • “Why Python with Coffee is better than R with Ice Cream”
    • [FAVE LANG] is BETTER BECAUSE I SAY SO
    • LANGUAGE WARS However, it is good to have a general understanding of the + and - of the various data analysis tools, in order to pick the right tool for the job. • R has EVERYTHING you need for performing statistical analysis. • R / MATLAB / Python are great for prototyping • Python is a full featured programming language • Easier to incorportate Python outcomes into a full data product workflow
    • DEFINE THE PROBLEM Time better spent defining the problem and determining what is the best way to solve it GOOD TO HAVE A BIG BAG OF TRICKS Re-do R analysis using Python data analysis stack WILL IT PYTHON? CREDIT: SLENDER MEANS
    • PYTHON SCIKIT LEARN IT IS PRETTY AWESOME • Library of Machine Learning Algorithms • Open source • API • Python, Numpy & Co • Accessible, many models, documentation & examples
    • EXAMPLE
    • CHOOSING A PROBLEM Always a good idea to look for a data set that is interesting to you. 1 2 Formulate a question 3 Formulate an hypothesis 4 Build Model to answer question and Test SCIENTIFIC METHOD FTW
    • CHOOSING A DATA SET STEP 1
    • EMI MUSIC “ONE MILLION INTERVIEW SET” • One of the largest preference data sets in the world. • Extract used in Data Science London hackaton and available in KAGGLE as four separate data sets.
    • FOUR DATA SETS • TRAIN / TEST - artist, track, userID, time & ratings • WORDS - userID, heard_of, own_artist_music , like_artist, 82 adjectives • USERS - userID, gender, age, working status, region, music, list_own (hours per day), list_back (hours per day), 19 user habits questions (0-100)
    • USERS KEY STRING 1 “Music is important to me but not necessarily most important” 2 “I like music but it does not feature heavily in my life” 3 “Music means a lot to me and it is a passion of mine” 4 “Music has no particular interest to me” 5 “Music is important to me but not necessarily more important than other hobbies” 6 “Music is no longer as important as it used to be”
    • WORDS DATASET UNINSPIRED, AGGRESSIVE, UNATTRACTIVE, BORING, CHEAP, IRRELEVANT, WAY OUT, ANNOYING, CHEESY, UNORIGINAL, OUTDATED, UNAPPROACHABLE... 82 ADJECTIVES
    • WHOLESOME LEGENDARY OLD PIONEER DARK WORDLY NOSTALGIC PROGRESSIVE ICONIC
    • USERS 19 MUSIC HABIT QUESTIONS:Rate (0-100) whether user agrees with the statements: “I enjoy actively searching for and discovering music that I have never heard before” “I am not willing to pay for music” “I like to be at the cutting edge of new music” “I love tech”
    • WHOLESOME LEGENDARY OLD PIONEER DARK WORDLY NOSTALGIC PROGRESSIVE ICONIC
    • FORMULATE A QUESTION STEP 2
    • MOTIVATION
    • MOTIVATION • PRODUCTION - Cheaper to produce (lower barriers to entry for budding artists). • DISTRIBUTION - Internet has made music more accessible. Artists can decide where and how to sell. • CONSUMPTION - People’s listening habits have changed due to the internet and to the change in devices. TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.
    • PROBLEMS • ARTISTS - Easier to produce music, harder to make themselves known or earn a living. • RECORD COMPANIES - People buy per song, easy for listener to consume without paying. Wider competition field. • LISTENERS - Too many choices. Discovery is difficult.
    • QUESTIONS • Can one predict the rating of a song? • What factors are important to determine how much a person likes a song? • What is the minimal set of factors that are needed to determine how much a person likes a song?
    • FORMULATE AN HYPOTHESIS STEP 3
    • FIRST ATTEMPT • Regression problem • Turn categorical variables into numeric variables • Consider ALL features and pick machine learning algorithm to do the job. CAN ONE PREDICT THE RATING OF A SONG?
    • FIRST ATTEMPT • Because exploratory analysis revealed ratings are highly clustered, we can look at five different scores and formulate problem as a classification one. CAN ONE PREDICT THE RATING OF A SONG? We split ratings 0-100 in 5 intervals, so each becomes a class and we label these.
    • BUILD A MODEL STEP 4
    • RANDOM FORESTS
    • RANDOM FORESTS • Random Forests are built from aggregating trees. • Can be used for regression & classification problems. • They do not overfit and can handle large amount of features • They also output a list of features that are believed to be important in predicting the variable Highly versatile ensemble method - combines several models into one. A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)
    • RANDOM FORESTS THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011) MOVIES 20 QUESTIONS WILL JAMIE LIKE X? BRIENNE IS THE DECISION TREE FOR JAMIE’S MOVIES PREFERENCES
    • RANDOM FORESTS THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011) Ask Tywin, Cersei, Tyrion...Jamie gives each of them slightly different info. THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES Jamie demands getting different questions every time. THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES
    • RANDOM FORESTS • A tree of maximal depth is grown on a bootstrap sample of size m of the training set. There is no pruning. • A number m << p is specified such that at each node, m variables are sampled at random out of p. The best split of these variables is used to split the node into two subnodes. • Final classification is given by majority voting of the ensemble of trees in the forest. • Only two “free” parameters: number of trees and number of variables in random subset at each node.
    • RANDOM FORESTS OUT-OF-BAG (OOB) ERROR Each bootstrap sample not used in the construction of the tree becomes a test set. The oob error estimate is given by the misclassification error (MSE for regression), averaged over all samples. VARIABLE IMPORTANCE Determined by looking at how much prediction error increases when (OOB) data for that variable is permuted while all others are left unchanged.
    • RANDOM FORESTS IN R & PYTHON randomForest PACKAGE • Various implementations - randomForest, CARET, PARTY, BIGRF • We follow the KISS procedure - KEEP IT SIMPLE S. • One can test various values of mtry and the number of trees. Used randomForest package 4.6-7 with R 2.15. Defaults are n=500 trees & mtry= p/3 for regression & sqrt(p) for classification.
    • RANDOM FORESTS IN R & PYTHON SCIKIT LEARN Used SCIKIT LEARN 0.14.1 running Python version 2.7.5. COMPUTER: Macbook Pro 2.53 GHz Intel Core 2 Duo with 4 GB 1067 Mhz DDR3 runnning OS X 10.6.8 • Training Time • RSQ & RMSE (Regression) • Accuracy (Classification) For the comparison we will build “small” forests and focus on the following simple metrics:
    • RANDOM FORESTS IN R RESULTS REGRESSION Split data in training and test sets. Dataframe has 82,714 rows each and 114 columns. Parameters: 60 trees, sample of 50,000. Training time: 39.39 min RMSE: 14.587 RSQ: 0.581 rf  <-­‐  randomForest(training,ratings_train,ntree=60,   sampsize  =  50000,  importance  =  TRUE)
    • RANDOM FORESTS IN PYTHON RESULTS REGRESSION Split data in training and test sets. Dataframe has 82,714 rows each and 114 columns. Parameters: 60 trees, sample of 50,000. Training time: 3 min 7 sec RMSE: 14.687 RSQ: 0.575 rf  =  RandomForestRegressor(n_estimators=60,   max_features='sqrt')
    • RANDOM FORESTS IN R & PYTHON R PYTHON / SCIKIT LEARN
    • RANDOM FORESTS IN R FEATURE IMPORTANCE FEATURE (% INC MSE) FEATURE (% INC NODE PURITY) Beautiful Talented Boring Like Artist Q16 Catchy Catchy Beautiful Talented Boring Q9 Track Q19 Distinctive None of these Cool Age Q11 Track Q12 Q16 - I would be willing to pay for the opp to buy new music pre-release Q9 - I am out of touch with new music Q19 - I like to know about music before other people Q11 -Pop music is fun Q12 - Pop music helps me escape Like artist - To what extent do you like or dislike listening to this artist?
    • RANDOM FORESTS IN R FEATURE IMPORTANCE
    • RANDOM FORESTS IN PYTHON FEATURE IMPORTANCE FEATURE IMPORTANCE IN R RANDOM FOREST Distinctive 7 Catchy 3 Like Artist 2 Fun - Talented 1 Beautiful 4 Original - Unoriginal - Q11 9 Own Artist Music - Own Artist Music - Do you have this artist in your music collection? Q11 -Pop music is fun
    • RANDOM FORESTS IN R & PYTHON Model RMSE R Random Forest 14.587 Python Scikit Learn Random Forest 14.687 Linear Regression 16.23 Multiple Linear Regs 15.53 RESULTS REGRESSION
    • RANDOM FORESTS IN RRESULTS CLASSIFICATION Training time: 8.75 min OOB error rate: 44.01% Accuracy: 0.567 rf  <-­‐  randomForest(training,ratings_train,ntree=60,   sampsize  =  50000,  importance  =  TRUE) ratings_train<-­‐as.factor(ratings_train) 1 2 3 4 5 1 16777 4863 1633 139 37 2 5760 12411 6213 504 89 3 1485 5559 13144 1880 329 4 176 888 4094 2592 625 5 59 204 1008 856 1388
    • RANDOM FORESTS IN PYTHON RESULTS CLASSIFICATION Training time: 2.56 min OOB Score: 0.1964 Accuracy: 0.566 rf  =  sk.RandomForestClassifier(n_estimators=60, compute_importances=True,  oob_score=True) 1 2 3 4 5 1 16930 4682 1758 129 53 2 5517 12369 6475 506 106 3 1500 5367 13448 1737 275 4 186 791 4171 2598 561 5 48 161 999 880 1466 Precision: 0.564 Recall: 0.5653 F1 Score: 0.5611
    • RANDOM FORESTS IN R FEATURE IMPORTANCE FEATURE (% INC MSE) FEATURE (% INC NODE PURITY) Q9 Track Q7 Q11 Q5 Q12 Q6 Age Age Q6 Q10 Q17 listBACK Q9 Q19 Q16 listOWN Q4 Q16 Q13 Q16 - I would be willing to pay for the opp to buy new music pre-release Q9 - I am out of touch with new music Q19 - I like to know about music before other people Q11 -Pop music is fun Q12 - Pop music helps me escape Q7 - I enjoy music primarily from going out to dance Q5 - I used to know where to find music Q6 - I am not willing to pay for music Q10 - My music collection is a source of pride Q4 - I would like to buy new music but I don’t know what to buy Q17 - I find seeing a new artist a useful way of discovering new music
    • RANDOM FORESTS IN PYTHON FEATURE IMPORTANCE FEATURE IMPORTANCE IN R RANDOM FOREST Q11 2 Q12 3 Age 4 Q6 5 Q17 6 Q5 - Q4 9 Q10 - Q16 7 Q7 - Q16 - I would be willing to pay for the opp to buy new music pre-release Q11 -Pop music is fun Q12 - Pop music helps me escape Q5 - I used to know where to find music Q6 - I am not willing to pay for music Q10 - My music collection is a source of pride Q4 - I would like to buy new music but I don’t know what to buy Q17 - I find seeing a new artist a useful way of discovering new music
    • RANDOM FORESTS IN R 1 2 3 4 5 CLASS 1 16777 4863 1633 139 37 28.45% 2 5760 12411 6213 504 89 50.31% 3 1485 5559 13144 1880 329 41.31% 4 176 888 4094 2592 625 69.09% 5 59 204 1008 856 1388 60.51% CONFUSION MATRIX
    • RANDOM FORESTS IN PYTHON 1 2 3 4 5 CLASS 1 16930 4682 1758 129 53 28.12% 2 5517 12369 6475 506 106 50.47% 3 1500 5367 13448 1737 275 39.77% 4 186 791 4171 2598 561 68.73% 5 48 161 999 880 1466 58.75% CONFUSION MATRIX
    • (Re)FORMULATE AN HYPOTHESIS STEP 2
    • FEATURE SELECTION PRINCIPAL COMPONENT ANALYSIS - WORDS Determine which features account for most of the variance. FEATURE PC1 PC2 Distinctive 0.20 -0.059 Authentic 0.19 -0.046 Talented 0.19 -0.083 Credible 0.19 -0.084 Stylish 0.18 -0.094 Annoying -0.06 -0.065 Intrusive -0.06 -0.058 Irrelevant -0.059 -0.087 Uninspired -0.056 -0.092 Noisy -0.053 -0.13
    • FEATURE SELECTIONMake a simple model choosing meaningful variables WORDS - Annoying, Depressing, Boring, Catchy, Talented, Distinctive, Beautiful, Superstar, Soulful and Popular. QUESTIONS - Q4, Q5, Q6, Q9, Q10 Q11 and Q19. • Running time in R ~ 15 min. • RMSE = 14.791 / Public leader board 13.076
    • RESULTS FULL MODELREDUCED MODEL
    • COMMENTS It is well known that Random Forests have shown to be biased towards highly correlated variables. Using conditional inference trees, ameliorates that bias (See Party PACKAGE in R) SCIKIT learn’s implementation has n_jobs parameter to parallelise training. For a similar feature in R, see bigRF package.
    • CONCLUDING REMARKS
    • CONCLUDING REMARKS We solved a problem using both R and PYTHON (via Scikit learn). Clearly constraints for addressing a given problem might differ and would dictate the implementation of choice. PICK THE TOOL THAT IS BEST FOR THE JOB WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS Both R and PYTHON (via SCIKIT LEARN) implementations have added functions that allow the user to explore the resulting model and its performance.
    • CONCLUDING REMARKS RANDOM FORESTS ARE GREAT KEEP AN EYE OUT FOR INTERESTING DATA It gives great accuracy, can handle many features, does not require cross validation and it even estimates what variables are important. Having data that you are interested in, leads to more interesting questions and reasons to explore new methods and add a new trick to your bag.
    • CONCLUDING REMARKS EMI DATASET IS GREAT TO TEST RIDE TO DO’s - WILL IT PYTHON? Set has a lot of behavioural information on a subject that everyone has some intuition. Prediction using SVM’s and other Matrix Factorisation techniques. Full factor analysis, etc.
    • THANKS!