Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions




                                 Machine Learning with R

                                                Joshua Reich

                                                  josh@i2pi.com


                                                 April 2, 2009
Outline    Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions




          Why R?

          How to find out about stuff?

          What is Machine Learning?

          Show me the money

          Learn More

          Questions
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                          ML Alternatives




           • Matlab
           • Weka
           • Python
           • Stand alone (e.g. Vowpal Wabbit)
Outline    Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions




          ”The best thing about R is that it was developed by statisticians.
          The worst thing about R is that it was developed by statisticians.”
                       –Bo Cowgill, Google (at SF R Meetup)
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                                   Why R?




           • Working with the CLI - iterative discovery
           • Integrated graphics
           • Community supported packages (CRAN)
           • ODBC Integration
           • You already use it
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                             How to find out about stuff?


           • ?function
           • help.search("search string")
           • RSiteSearch("search string")
           • http://rseek.org/
           • names(object) or attributes(object)
           • > kmeans
              function (x, centers, iter.max = 10, nstart = 1,
              algorithm = c("Hartigan-Wong",
              "Lloyd", "Forgy", "MacQueen"))
              ...
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                               What is Machine Learning?




                                 Statistics                  Machine Learning
                             Probability Model                Learning Model
                               Observations                    Observations
                                Estimation                       Training
                                   MLE                         Optimization
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                 Semantics v. Pragmatics

           • For most statistical models there are either closed form or
              quick numerical approximations for finding model properties -
              e.g., confidence intervals. Assuming you believe that your
              data generating process is accurately captured by your model,
              then you can make direct statements about unseen events.
           • Machine learning is a close cousin to non-parametric
              techniques and relies on training/testing/validation cycles,
              bootstrapping and cross-validation to determine measures of
              reliability.

            But invariably, simple models and a lot of data trump more
                        elaborate models based on less data
                              –Halevy, Norvig & Pereira.
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                            Inductive Bias

              In the days when Sussman was a novice, Minsky once
                  came to him as he sat hacking at the PDP-6.

              ”What are you doing?”, asked Minsky.

              ”I am training a randomly wired neural net to play
                  Tic-tac-toe”, Sussman replied.

              ”Why is the net wired randomly?”, asked Minsky.

              ”I do not want it to have any preconceptions of how to
                  play”, Sussman said.

              Minsky then shut his eyes.
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                            Inductive Bias




              ”Why do you close your eyes?” Sussman asked his
                teacher.

              ”So that the room will be empty.”
Outline   Why R?   How to find out about stuff?    What is Machine Learning?   Show me the money   Learn More   Questions



                               What is Machine Learning?


           • Regression vs. Classification

                                                         Y ∈ Rp
                                                             vs.
                                                Y ∈ {Y1 , Y2 , . . . , YN }
           • Supervised vs. Unsupervised Learning

                                                       Y = f (X )
                                                             vs.
                                                    X1 , X2 , . . . , XN
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                   General Supervised Learning Framework
                          (Ch 7 of ElemStatLearn)




           • Training / Validation / Test
           • Variance - Bias Decomposition: Overfitting
           • Feature Selection / Regularization
           • Bootstrapping / Cross-Validation
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                               What we will walk through



           • K-Means clustering: kmeans()
           • K-Nearest Neighbours: knn()
           • Regression Trees: rpart()
           • Improving trees with PCA: princomp()
           • Linear Discriminant Analysis: lda()
           • Support Vector Machines: svm()
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                                The Problem


                                     A
                          15
                                       A
                                                A
                                     A A A                         C
                                   AA         AA           C            CC       C C
                                                              CCC CCC C C
                                                   A A CC C C CCC CC C C C
                                                              CC C CC C
                                                    C CCCC C CCC C C C
                                                                         C
                                A           C A C CC CCCCCC CCCCCCCC
                                      AA AAAA CCCCC ACCCCCC CCCCCCCCC C
                                                                      C
                                                                  C C C
                          10



                                                         C CCC CC CC C
                                                               C C C
                                                              CC C
                                                                      C
                                       AA A AA CCCCCCCCCCCCCCCCCCCCCC C CC
                                              A A C CCCCCCCCCCCCCCCC CCC C
                                                      C C CCCCCCCCCCC C CC C
                                        A AC C CCC CCCCCCCCCCCC CCCC C
                                                              CC C CCCC C C C
                                                      CCCCCCCC CCCC C C CC
                                                        CCC CCCC CCCC C
                                                               C
                                       A       C CA A ACCC C CCCCC
                                   A AAC A A ACC CCCCC CCCCCCC C C C C C
                                                   A CC CC CCC C CC C C
                                                   C         C
                                                            C CC
                                        A A A C CCCCACACCCCCCCCCC C
                                                   C CCCCC CC C C C
                                                             C     C
                                       A C A AAACCCACCCCCCCCCCCCCCC C
                                                A A CC CCC CCCCCCCC CC
                                                A A CCCCC CCCCCCCC CC
                                                       A CC C C CC
                                  A A AAA ACCACCCCCCCCCCCCCCCCCCCCCC C C
                                                            C CC C
                                       A A C A A AACCCCCCCCCCCCC CCCCCC
                                       C A A A AACCCC CACCCCCCCC CC C CC
                                        A                                   C         C  C
                                                  ACC C CCCCCCCC CC C
                                                  CA CA CC C C CCCC
                                                            C CC
                                          C AAACAACCCCCCC CCCCCC C
                                                    C CC A C C C
                                   A A A AAAACAACAACCCACCCCCCCCCCCCC C
                                           A A A ACA A CACC C C CC C
                                       A A ACACCCCA CCACCCC CCCCC C C
                                                 A
                                                    C
                                           A AAAACAACCCCCCCC CCCC
                                                  AAA CCCC C CC CC C
                                                          AC CC C C C                 C
                                                  CA A C
                                           C A AAAAC A A C AC
                                                AA ACA AA C
                                    A A AA CAAAA A AACCA AAC
                                              AAAAA C C AC A
                                           A AA A A A                    CC
                          5




                                    A AA C A AAAAAAAA AA C
                                             A AA          AA
                                                           C
                                             A AAAAA AAAA A A
                                                                        C              B
                                                   A A AA
                                              A AAAA AAA
                                    A A AAAAAAAAAAAAAAAAAA A
                                            AAAA AAA AAAA A                      B BBBB
                                                                                BBBBBBB
                                                                              BBBBBBBBBBB
                                                                                 BB
                                             AAAAAAA AAAAA A A
                                                    AA A
                                                   AAAAAAA A
                                                   AAAAAAA A A
                                                     AA                     B BBBBBBB BB B
                                                                                 B BB
                                                                       A B B BBBBBBBBBBBB B
                                                                                BBBB B B B
                                                                                BBBBB B
                                                                                  B
                                                AAAAAAAAAA A A                BBB BBBBB BBB
                                         A A AAAAAAAAAAAAAAA A A B BBBBBBBBBBBBBBBB
                                               AAAAA AAAAAAAA                    B BB B
                     y




                                                     A AAA A A A
                                                      A AAA A A
                                       A A AAA AAAA A A AA A
                                                AAAAAAAAAA AA
                                                     A
                                          A AA AAAAAAAA AAA
                                                    AAA AAAA A
                                                           A AA A               BBBBBB B
                                                                         B BBBBBBBBBBBBB B
                                                                              BBBBBBBBBB
                                                                          B BBBBBBBBBBBBBB
                                                                              BBBBBBB B
                                                                          B B BBBBBBBBBBBBB
                                                                                   BB B
                                                                                     B
                                                                             BBBBBBBBBBBBB
                                                                              BBBBBBBBBB
                                                                               BBBBBBB B
                                         A AAAAAAAAAAAAAA A BBBBBBBBBBBBBBBB B B
                                                                               BBBBBBBBBB
                                                                                 BB BBB B
                                                                                 BBBBBB B B
                                                                                      BB B
                                                    A A AA A A
                                                     A AA A
                                              A A A AAAAAAAAA A           BBBBB BBBBBBBB
                                                                               B BBB
                                                                            BB B B BB
                                                                           BBBBBB BBB
                                                                               BBB B
                                                                            B B BBBB BBBB
                                                                         B BBBBBBBBBBB BB B
                                              A A A AAAAAAAA A A A B BBBBBBBBBBBBB B B
                                                     A A AA AA A            B B BBBBBBBBB
                                                                             B B BB BBB B
                                                                             BB BBBBBBBB
                          0




                                             AAA AAAAAAAAAAAAA                    BBB
                                               A A A AAAAA AA AA A A ABBBB BBBBBB BB B
                                                    A AAA A AA A
                                                          A
                                                    A AA AA A A
                                         A AAAAAA AAAAAAA A AAAA
                                                         AAA A
                                                          A                       BB B
                                                                             BB BB BB B
                                                                                    B
                                                                                    B
                                                  A AAAA AA AAAA AA
                                                          AA A A A
                                                          A A
                                               AA AA AAA A AAAAA A A BB B B B
                                                     AAAAAA AAAA A
                                                           AA A
                                                                                    B
                                                A      A AAAA AA A AA A
                                                        A A A    A
                                                       AAAAAA AAA AAA
                                                                 A
                                                            AA AA A A
                                                      A A A AA A      A
                                                A A AAAAAAA AA A A A
                                                              A
                                                         AAAAA A AAA A A
                                                             AA
                                                    AA AAAAA AAAAA A A A
                                                            A AA AA
                                                             AA A A
                                                                A A
                          −5




                                                           A A AAA
                                                            AA A
                                                        AAAAA AAAA AA A A
                                                         AAA AAAA A
                                                                AAA A
                                                          A A A AAA A A
                                                                    A
                                                            A AA A A AA A AA
                                                                  A A A         A
                                                              AA AA AAAAA
                                                                    A AA
                          −10




                                                        A                    A     A
                                                                   AA
                                                                    A        A
                                                                          A

                                −5                0                 5                10

                                                             x
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions



                                   Machine Learning More


           • MachineLearning CRAN view
              http://cran.r-project.org/web/views/MachineLearning.html
           • The caret package is a good one.
           • Elements of Statistical Learning
              http://www-stat.stanford.edu/ tibs/ElemStatLearn/
           • Machine Learning (Mitchell)
              http://www.cs.cmu.edu/ tom/mlbook.html
           • Video Lectures
              http://videolectures.net/
Outline   Why R?   How to find out about stuff?   What is Machine Learning?   Show me the money   Learn More   Questions




                                                  Questions?

Machine Learning with R

  • 1.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Machine Learning with R Joshua Reich josh@i2pi.com April 2, 2009
  • 2.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
  • 3.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions ML Alternatives • Matlab • Weka • Python • Stand alone (e.g. Vowpal Wabbit)
  • 4.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions ”The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.” –Bo Cowgill, Google (at SF R Meetup)
  • 5.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Why R? • Working with the CLI - iterative discovery • Integrated graphics • Community supported packages (CRAN) • ODBC Integration • You already use it
  • 6.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions How to find out about stuff? • ?function • help.search("search string") • RSiteSearch("search string") • http://rseek.org/ • names(object) or attributes(object) • > kmeans function (x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) ...
  • 7.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions What is Machine Learning? Statistics Machine Learning Probability Model Learning Model Observations Observations Estimation Training MLE Optimization
  • 8.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Semantics v. Pragmatics • For most statistical models there are either closed form or quick numerical approximations for finding model properties - e.g., confidence intervals. Assuming you believe that your data generating process is accurately captured by your model, then you can make direct statements about unseen events. • Machine learning is a close cousin to non-parametric techniques and relies on training/testing/validation cycles, bootstrapping and cross-validation to determine measures of reliability. But invariably, simple models and a lot of data trump more elaborate models based on less data –Halevy, Norvig & Pereira.
  • 9.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Inductive Bias In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. ”What are you doing?”, asked Minsky. ”I am training a randomly wired neural net to play Tic-tac-toe”, Sussman replied. ”Why is the net wired randomly?”, asked Minsky. ”I do not want it to have any preconceptions of how to play”, Sussman said. Minsky then shut his eyes.
  • 10.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Inductive Bias ”Why do you close your eyes?” Sussman asked his teacher. ”So that the room will be empty.”
  • 11.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions What is Machine Learning? • Regression vs. Classification Y ∈ Rp vs. Y ∈ {Y1 , Y2 , . . . , YN } • Supervised vs. Unsupervised Learning Y = f (X ) vs. X1 , X2 , . . . , XN
  • 12.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions General Supervised Learning Framework (Ch 7 of ElemStatLearn) • Training / Validation / Test • Variance - Bias Decomposition: Overfitting • Feature Selection / Regularization • Bootstrapping / Cross-Validation
  • 13.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions What we will walk through • K-Means clustering: kmeans() • K-Nearest Neighbours: knn() • Regression Trees: rpart() • Improving trees with PCA: princomp() • Linear Discriminant Analysis: lda() • Support Vector Machines: svm()
  • 14.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions The Problem A 15 A A A A A C AA AA C CC C C CCC CCC C C A A CC C C CCC CC C C C CC C CC C C CCCC C CCC C C C C A C A C CC CCCCCC CCCCCCCC AA AAAA CCCCC ACCCCCC CCCCCCCCC C C C C C 10 C CCC CC CC C C C C CC C C AA A AA CCCCCCCCCCCCCCCCCCCCCC C CC A A C CCCCCCCCCCCCCCCC CCC C C C CCCCCCCCCCC C CC C A AC C CCC CCCCCCCCCCCC CCCC C CC C CCCC C C C CCCCCCCC CCCC C C CC CCC CCCC CCCC C C A C CA A ACCC C CCCCC A AAC A A ACC CCCCC CCCCCCC C C C C C A CC CC CCC C CC C C C C C CC A A A C CCCCACACCCCCCCCCC C C CCCCC CC C C C C C A C A AAACCCACCCCCCCCCCCCCCC C A A CC CCC CCCCCCCC CC A A CCCCC CCCCCCCC CC A CC C C CC A A AAA ACCACCCCCCCCCCCCCCCCCCCCCC C C C CC C A A C A A AACCCCCCCCCCCCC CCCCCC C A A A AACCCC CACCCCCCCC CC C CC A C C C ACC C CCCCCCCC CC C CA CA CC C C CCCC C CC C AAACAACCCCCCC CCCCCC C C CC A C C C A A A AAAACAACAACCCACCCCCCCCCCCCC C A A A ACA A CACC C C CC C A A ACACCCCA CCACCCC CCCCC C C A C A AAAACAACCCCCCCC CCCC AAA CCCC C CC CC C AC CC C C C C CA A C C A AAAAC A A C AC AA ACA AA C A A AA CAAAA A AACCA AAC AAAAA C C AC A A AA A A A CC 5 A AA C A AAAAAAAA AA C A AA AA C A AAAAA AAAA A A C B A A AA A AAAA AAA A A AAAAAAAAAAAAAAAAAA A AAAA AAA AAAA A B BBBB BBBBBBB BBBBBBBBBBB BB AAAAAAA AAAAA A A AA A AAAAAAA A AAAAAAA A A AA B BBBBBBB BB B B BB A B B BBBBBBBBBBBB B BBBB B B B BBBBB B B AAAAAAAAAA A A BBB BBBBB BBB A A AAAAAAAAAAAAAAA A A B BBBBBBBBBBBBBBBB AAAAA AAAAAAAA B BB B y A AAA A A A A AAA A A A A AAA AAAA A A AA A AAAAAAAAAA AA A A AA AAAAAAAA AAA AAA AAAA A A AA A BBBBBB B B BBBBBBBBBBBBB B BBBBBBBBBB B BBBBBBBBBBBBBB BBBBBBB B B B BBBBBBBBBBBBB BB B B BBBBBBBBBBBBB BBBBBBBBBB BBBBBBB B A AAAAAAAAAAAAAA A BBBBBBBBBBBBBBBB B B BBBBBBBBBB BB BBB B BBBBBB B B BB B A A AA A A A AA A A A A AAAAAAAAA A BBBBB BBBBBBBB B BBB BB B B BB BBBBBB BBB BBB B B B BBBB BBBB B BBBBBBBBBBB BB B A A A AAAAAAAA A A A B BBBBBBBBBBBBB B B A A AA AA A B B BBBBBBBBB B B BB BBB B BB BBBBBBBB 0 AAA AAAAAAAAAAAAA BBB A A A AAAAA AA AA A A ABBBB BBBBBB BB B A AAA A AA A A A AA AA A A A AAAAAA AAAAAAA A AAAA AAA A A BB B BB BB BB B B B A AAAA AA AAAA AA AA A A A A A AA AA AAA A AAAAA A A BB B B B AAAAAA AAAA A AA A B A A AAAA AA A AA A A A A A AAAAAA AAA AAA A AA AA A A A A A AA A A A A AAAAAAA AA A A A A AAAAA A AAA A A AA AA AAAAA AAAAA A A A A AA AA AA A A A A −5 A A AAA AA A AAAAA AAAA AA A A AAA AAAA A AAA A A A A AAA A A A A AA A A AA A AA A A A A AA AA AAAAA A AA −10 A A A AA A A A −5 0 5 10 x
  • 15.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Machine Learning More • MachineLearning CRAN view http://cran.r-project.org/web/views/MachineLearning.html • The caret package is a good one. • Elements of Statistical Learning http://www-stat.stanford.edu/ tibs/ElemStatLearn/ • Machine Learning (Mitchell) http://www.cs.cmu.edu/ tom/mlbook.html • Video Lectures http://videolectures.net/
  • 16.
    Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions Questions?