Strategies & Tools for
                       Parallel Machine Learning
                                in Python
                             PyConFR 2012 - Paris




Sunday, September 16, 2012
The Story
                       scikit-learn


                       The Cloud


                       stackoverflow
                       & kaggle


Sunday, September 16, 2012
...




Sunday, September 16, 2012
Parts of the Ecosystem
             Single Machine with Multiple Cores ♥♥♥


                 multiprocessing

             Multiple Machines with Multiple Cores ♥♥♥




Sunday, September 16, 2012
The Problem
                   Big CPU (Supercomputers - MPI)
                             Simulating stuff from models
                   Big Data (Google scale - MapReduce)
                             Counting stuff in logs / Indexing the Web
                   Machine Learning?
                                     often somewhere in the middle


Sunday, September 16, 2012
Parallel ML Use Cases

                   • Model Evaluation with Cross Validation
                   • Model Selection with Grid Search
                   • Bagging Models: Random Forests
                   • Averaged Models

Sunday, September 16, 2012
Embarrassingly Parallel
                      ML Use Cases
                   • Model Evaluation with Cross Validation
                   • Model Selection with Grid Search
                   • Bagging Models: Random Forests
                   • Averaged Models

Sunday, September 16, 2012
Inter-Process Comm.
                              Use Cases
                   • Model Evaluation with Cross Validation
                   • Model Selection with Grid Search
                   • Bagging Models: Random Forests
                   • Averaged Models

Sunday, September 16, 2012
Cross Validation


                                   Input Data
                                Labels to Predict




Sunday, September 16, 2012
Cross Validation


                             A      B       C
                             A      B       C




Sunday, September 16, 2012
Cross Validation


                             A         B            C
                             A         B            C

                  Subset of the data used       Held-out
                    to train the model           test set
                                              for evaluation
Sunday, September 16, 2012
Cross Validation
                             A         B            C
                             A         B            C

                             A         C            B
                             A         C            B

                             B         C            A
                             B         C            A

Sunday, September 16, 2012
Model Selection
                                               the Hyperparameters hell
                       p1 in [1, 10, 100]
                       p2 in [1e3, 1e4, 1e5]


                       Find the best combination of parameters
                       that maximizes the Cross Validated Score


Sunday, September 16, 2012
Grid Search
                                               p1

                                  (1, 1e3)   (10, 1e3) (100, 1e3)



                             p2   (1, 1e4)   (10, 1e4) (100, 1e4)



                                  (1, 1e5)   (10, 1e5) (100, 1e5)


Sunday, September 16, 2012
(1, 1e3)   (10, 1e3)   (100, 1e3)




                  (1, 1e4)   (10, 1e4)   (100, 1e4)




                  (1, 1e5)   (10, 1e5)   (100, 1e5)




Sunday, September 16, 2012
Grid Search:
                             Qualitative Results




Sunday, September 16, 2012
Grid Search:
                   Cross Validated Scores




Sunday, September 16, 2012
Enough ML theory!

                         Lets go shopping^W
                          parallel computing!

Sunday, September 16, 2012
Single Machine
                                   with
                             Multiple Cores
                                  ♥ ♥
                                  ♥ ♥




Sunday, September 16, 2012
multiprocessing
    >>> from multiprocessing import Pool
    >>> p = Pool(4)

    >>> p.map(type, [1, 2., '3'])
    [int, float, str]

    >>> r = p.map_async(type, [1, 2., '3'])
    >>> r.get()
    [int, float, str]



Sunday, September 16, 2012
multiprocessing

                   • Part of the standard lib
                   • Nice API
                   • Cross-Platform support (even Windows!)
                   • Some support for shared memory
                   • Support for synchronization (Lock)

Sunday, September 16, 2012
multiprocessing:
                                  limitations

                   • No docstrings in a stdlib module? WTF?
                   • Tricky / impossible to use the shared
                             memory values with NumPy

                   • Bad support for KeyboardInterrupt


Sunday, September 16, 2012
Sunday, September 16, 2012
• transparent disk-caching of the output
                       values and lazy re-evaluation (memoize
                       pattern)
               • easy simple parallel computing
               • logging and tracing of the execution

Sunday, September 16, 2012
joblib.Parallel
   >>> from os.path.join
   >>> from joblib import Parallel, delayed

   >>> Parallel(2)(
   ...    delayed(join)('/ect', s)
   ...    for s in 'abc')
   ['/ect/a', '/ect/b', '/ect/c']




Sunday, September 16, 2012
Usage in scikit-learn

               • Cross Validation
                       cross_val(model, X, y, n_jobs=4, cv=3)


               • Grid Search
                       GridSearchCV(model, n_jobs=4, cv=3).fit(X, y)


               • Random Forests
                       RandomForestClassifier(n_jobs=4).fit(X, y)




Sunday, September 16, 2012
joblib.Parallel:
                             shared memory
   >>> from joblib import Parallel, delayed
   >>> import numpy as np

   >>> Parallel(2, max_nbytes=1e6)(
   ...    delayed(type)(np.zeros(int(i)))
   ...    for i in [1e4, 1e6])
   [<type 'numpy.ndarray'>, <class
   'numpy.core.memmap.memmap'>]




Sunday, September 16, 2012
Only 3 allocated datasets shared
                         by all the concurrent workers performing
                                      the grid search.
                             (1, 1e3)    (10, 1e3)    (100, 1e3)



                             (1, 1e4)    (10, 1e4)    (100, 1e4)



                             (1, 1e5)    (10, 1e5)    (100, 1e5)




Sunday, September 16, 2012
Multiple Machines
                                    with
                              Multiple Cores
                             ♥ ♥   ♥ ♥   ♥ ♥   ♥ ♥
                             ♥ ♥   ♥ ♥   ♥ ♥   ♥ ♥



Sunday, September 16, 2012
MapReduce?
                             mapper
      [                               [                    [
       (k1, v1),                       (k3, v3), reducer    (k5, v6),
       (k2, v2),             mapper    (k4, v4),            (k6, v6),
        ...                             ...      reducer     ...
      ]                               ]                    ]
                             mapper




Sunday, September 16, 2012
Why MapReduce does
                    not always work
                       Write a lot of stuff to disk for failover
                              Inefficient for small to medium problems
                       [(k, v)] mapper [(k, v)] reducer [(k, v)]
                              Data and model params as (k, v) pairs?
                       Complex to leverage for Iterative
                       Algorithms


Sunday, September 16, 2012
IPython.parallel

               • Parallel Processing Library
               • Interactive Exploratory Shell
                                 Multi Core & Distributed



Sunday, September 16, 2012
Demo!



Sunday, September 16, 2012
The AllReduce Pattern

                   • Compute an aggregate (average) of active
                             node data
                   • Do not clog a single node with incoming
                             data transfer
                   • Traditionally implemented in MPI systems

Sunday, September 16, 2012
AllReduce 0/3
                              Initial State
           Value: 1.0            Value: 2.0   Value: 0.5




           Value: 1.1            Value: 3.2   Value: 0.9



Sunday, September 16, 2012
AllReduce 1/3
                             Spanning Tree
           Value: 1.0            Value: 2.0   Value: 0.5




           Value: 1.1            Value: 3.2   Value: 0.9



Sunday, September 16, 2012
AllReduce 2/3
                             Upward Averages
           Value: 1.0             Value: 2.0   Value: 0.5




           Value: 1.1             Value: 3.2   Value: 0.9
           (1.1, 1)               (3.1, 1)     (0.9, 1)



Sunday, September 16, 2012
AllReduce 2/3
                             Upward Averages
                                  Value: 2.0   Value: 0.5
           Value: 1.0
                                  (2.1, 3)     (0.7, 2)




           Value: 1.1             Value: 3.2   Value: 0.9
            (1.1, 1)               (3.1, 1)     (0.9, 1)



Sunday, September 16, 2012
AllReduce 2/3
                             Upward Averages
           Value: 1.0             Value: 2.0   Value: 0.5
           (1.38, 6)               (2.1, 3)     (0.7, 2)




           Value: 1.1             Value: 3.2   Value: 0.9
            (1.1, 1)               (3.1, 1)     (0.9, 1)



Sunday, September 16, 2012
AllReduce 3/3
                             Downward Updates
                                   Value: 2.0   Value: 0.5
         Value: 1.38
                                    (2.1, 3)     (0.7, 2)




           Value: 1.1              Value: 3.2   Value: 0.9
            (1.1, 1)                (3.1, 1)     (0.9, 1)



Sunday, September 16, 2012
AllReduce 3/3
                             Downward Updates
          Value: 1.38             Value: 1.38   Value: 1.38




           Value: 1.1              Value: 3.2    Value: 0.9
            (1.1, 1)                (3.1, 1)      (0.9, 1)



Sunday, September 16, 2012
AllReduce 3/3
                             Downward Updates
          Value: 1.38             Value: 1.38   Value: 1.38




         Value: 1.38              Value: 1.38   Value: 1.38



Sunday, September 16, 2012
AllReduce Final State
          Value: 1.38         Value: 1.38   Value: 1.38




          Value: 1.38         Value: 1.38   Value: 1.38



Sunday, September 16, 2012
AllReduce
                             Implementations
                       http://mpi4py.scipy.org


                       IPC directly w/ IPython.parallel
                       https://github.com/ipython/ipython/tree/
                       master/docs/examples/parallel/interengine



Sunday, September 16, 2012
Working in the Cloud

               • Launch a cluster of machines in one cmd:
                       starcluster start mycluster -b 0.07

                       starcluster sshmaster mycluster


               • Supports spotinstances!
               • Ships blas, atlas, numpy, scipy!
               • IPython plugin!
Sunday, September 16, 2012
Perspectives



Sunday, September 16, 2012
2012 results by
                             Stanford / Google




Sunday, September 16, 2012
The YouTube Neuron




Sunday, September 16, 2012
Thanks
                   • http://j.mp/ogrisel-pyconfr-2012
                   • http://scikit-learn.org
                   • http://packages.python.org/joblib
                   • http://ipython.org
                   • http://star.mit.edu/cluster/
                                                         @ogrisel


Sunday, September 16, 2012
Goodies


                             Super Ctrl-C to killall nosetests




Sunday, September 16, 2012
psutil nosetests killer script in Automator

Sunday, September 16, 2012
Bind killnose to Shift-Ctrl-C

Sunday, September 16, 2012
Or simpler:

                      ps aux |   grep IPython.parallel 
                             |   grep -v grep 
                             |   cut -d ' ' -f 2
                             |   xargs kill




Sunday, September 16, 2012

Strategies and Tools for Parallel Machine Learning in Python

  • 1.
    Strategies & Toolsfor Parallel Machine Learning in Python PyConFR 2012 - Paris Sunday, September 16, 2012
  • 2.
    The Story scikit-learn The Cloud stackoverflow & kaggle Sunday, September 16, 2012
  • 3.
  • 4.
    Parts of theEcosystem Single Machine with Multiple Cores ♥♥♥ multiprocessing Multiple Machines with Multiple Cores ♥♥♥ Sunday, September 16, 2012
  • 5.
    The Problem Big CPU (Supercomputers - MPI) Simulating stuff from models Big Data (Google scale - MapReduce) Counting stuff in logs / Indexing the Web Machine Learning? often somewhere in the middle Sunday, September 16, 2012
  • 6.
    Parallel ML UseCases • Model Evaluation with Cross Validation • Model Selection with Grid Search • Bagging Models: Random Forests • Averaged Models Sunday, September 16, 2012
  • 7.
    Embarrassingly Parallel ML Use Cases • Model Evaluation with Cross Validation • Model Selection with Grid Search • Bagging Models: Random Forests • Averaged Models Sunday, September 16, 2012
  • 8.
    Inter-Process Comm. Use Cases • Model Evaluation with Cross Validation • Model Selection with Grid Search • Bagging Models: Random Forests • Averaged Models Sunday, September 16, 2012
  • 9.
    Cross Validation Input Data Labels to Predict Sunday, September 16, 2012
  • 10.
    Cross Validation A B C A B C Sunday, September 16, 2012
  • 11.
    Cross Validation A B C A B C Subset of the data used Held-out to train the model test set for evaluation Sunday, September 16, 2012
  • 12.
    Cross Validation A B C A B C A C B A C B B C A B C A Sunday, September 16, 2012
  • 13.
    Model Selection the Hyperparameters hell p1 in [1, 10, 100] p2 in [1e3, 1e4, 1e5] Find the best combination of parameters that maximizes the Cross Validated Score Sunday, September 16, 2012
  • 14.
    Grid Search p1 (1, 1e3) (10, 1e3) (100, 1e3) p2 (1, 1e4) (10, 1e4) (100, 1e4) (1, 1e5) (10, 1e5) (100, 1e5) Sunday, September 16, 2012
  • 15.
    (1, 1e3) (10, 1e3) (100, 1e3) (1, 1e4) (10, 1e4) (100, 1e4) (1, 1e5) (10, 1e5) (100, 1e5) Sunday, September 16, 2012
  • 16.
    Grid Search: Qualitative Results Sunday, September 16, 2012
  • 17.
    Grid Search: Cross Validated Scores Sunday, September 16, 2012
  • 18.
    Enough ML theory! Lets go shopping^W parallel computing! Sunday, September 16, 2012
  • 19.
    Single Machine with Multiple Cores ♥ ♥ ♥ ♥ Sunday, September 16, 2012
  • 20.
    multiprocessing >>> from multiprocessing import Pool >>> p = Pool(4) >>> p.map(type, [1, 2., '3']) [int, float, str] >>> r = p.map_async(type, [1, 2., '3']) >>> r.get() [int, float, str] Sunday, September 16, 2012
  • 21.
    multiprocessing • Part of the standard lib • Nice API • Cross-Platform support (even Windows!) • Some support for shared memory • Support for synchronization (Lock) Sunday, September 16, 2012
  • 22.
    multiprocessing: limitations • No docstrings in a stdlib module? WTF? • Tricky / impossible to use the shared memory values with NumPy • Bad support for KeyboardInterrupt Sunday, September 16, 2012
  • 23.
  • 24.
    • transparent disk-cachingof the output values and lazy re-evaluation (memoize pattern) • easy simple parallel computing • logging and tracing of the execution Sunday, September 16, 2012
  • 25.
    joblib.Parallel >>> from os.path.join >>> from joblib import Parallel, delayed >>> Parallel(2)( ... delayed(join)('/ect', s) ... for s in 'abc') ['/ect/a', '/ect/b', '/ect/c'] Sunday, September 16, 2012
  • 26.
    Usage in scikit-learn • Cross Validation cross_val(model, X, y, n_jobs=4, cv=3) • Grid Search GridSearchCV(model, n_jobs=4, cv=3).fit(X, y) • Random Forests RandomForestClassifier(n_jobs=4).fit(X, y) Sunday, September 16, 2012
  • 27.
    joblib.Parallel: shared memory >>> from joblib import Parallel, delayed >>> import numpy as np >>> Parallel(2, max_nbytes=1e6)( ... delayed(type)(np.zeros(int(i))) ... for i in [1e4, 1e6]) [<type 'numpy.ndarray'>, <class 'numpy.core.memmap.memmap'>] Sunday, September 16, 2012
  • 28.
    Only 3 allocateddatasets shared by all the concurrent workers performing the grid search. (1, 1e3) (10, 1e3) (100, 1e3) (1, 1e4) (10, 1e4) (100, 1e4) (1, 1e5) (10, 1e5) (100, 1e5) Sunday, September 16, 2012
  • 29.
    Multiple Machines with Multiple Cores ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ Sunday, September 16, 2012
  • 30.
    MapReduce? mapper [ [ [ (k1, v1), (k3, v3), reducer (k5, v6), (k2, v2), mapper (k4, v4), (k6, v6), ... ... reducer ... ] ] ] mapper Sunday, September 16, 2012
  • 31.
    Why MapReduce does not always work Write a lot of stuff to disk for failover Inefficient for small to medium problems [(k, v)] mapper [(k, v)] reducer [(k, v)] Data and model params as (k, v) pairs? Complex to leverage for Iterative Algorithms Sunday, September 16, 2012
  • 32.
    IPython.parallel • Parallel Processing Library • Interactive Exploratory Shell Multi Core & Distributed Sunday, September 16, 2012
  • 33.
  • 34.
    The AllReduce Pattern • Compute an aggregate (average) of active node data • Do not clog a single node with incoming data transfer • Traditionally implemented in MPI systems Sunday, September 16, 2012
  • 35.
    AllReduce 0/3 Initial State Value: 1.0 Value: 2.0 Value: 0.5 Value: 1.1 Value: 3.2 Value: 0.9 Sunday, September 16, 2012
  • 36.
    AllReduce 1/3 Spanning Tree Value: 1.0 Value: 2.0 Value: 0.5 Value: 1.1 Value: 3.2 Value: 0.9 Sunday, September 16, 2012
  • 37.
    AllReduce 2/3 Upward Averages Value: 1.0 Value: 2.0 Value: 0.5 Value: 1.1 Value: 3.2 Value: 0.9 (1.1, 1) (3.1, 1) (0.9, 1) Sunday, September 16, 2012
  • 38.
    AllReduce 2/3 Upward Averages Value: 2.0 Value: 0.5 Value: 1.0 (2.1, 3) (0.7, 2) Value: 1.1 Value: 3.2 Value: 0.9 (1.1, 1) (3.1, 1) (0.9, 1) Sunday, September 16, 2012
  • 39.
    AllReduce 2/3 Upward Averages Value: 1.0 Value: 2.0 Value: 0.5 (1.38, 6) (2.1, 3) (0.7, 2) Value: 1.1 Value: 3.2 Value: 0.9 (1.1, 1) (3.1, 1) (0.9, 1) Sunday, September 16, 2012
  • 40.
    AllReduce 3/3 Downward Updates Value: 2.0 Value: 0.5 Value: 1.38 (2.1, 3) (0.7, 2) Value: 1.1 Value: 3.2 Value: 0.9 (1.1, 1) (3.1, 1) (0.9, 1) Sunday, September 16, 2012
  • 41.
    AllReduce 3/3 Downward Updates Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.1 Value: 3.2 Value: 0.9 (1.1, 1) (3.1, 1) (0.9, 1) Sunday, September 16, 2012
  • 42.
    AllReduce 3/3 Downward Updates Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Sunday, September 16, 2012
  • 43.
    AllReduce Final State Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Sunday, September 16, 2012
  • 44.
    AllReduce Implementations http://mpi4py.scipy.org IPC directly w/ IPython.parallel https://github.com/ipython/ipython/tree/ master/docs/examples/parallel/interengine Sunday, September 16, 2012
  • 45.
    Working in theCloud • Launch a cluster of machines in one cmd: starcluster start mycluster -b 0.07 starcluster sshmaster mycluster • Supports spotinstances! • Ships blas, atlas, numpy, scipy! • IPython plugin! Sunday, September 16, 2012
  • 46.
  • 47.
    2012 results by Stanford / Google Sunday, September 16, 2012
  • 48.
    The YouTube Neuron Sunday,September 16, 2012
  • 49.
    Thanks • http://j.mp/ogrisel-pyconfr-2012 • http://scikit-learn.org • http://packages.python.org/joblib • http://ipython.org • http://star.mit.edu/cluster/ @ogrisel Sunday, September 16, 2012
  • 50.
    Goodies Super Ctrl-C to killall nosetests Sunday, September 16, 2012
  • 51.
    psutil nosetests killerscript in Automator Sunday, September 16, 2012
  • 52.
    Bind killnose toShift-Ctrl-C Sunday, September 16, 2012
  • 53.
    Or simpler: ps aux | grep IPython.parallel | grep -v grep | cut -d ' ' -f 2 | xargs kill Sunday, September 16, 2012