A Programming
                     Language/Toolkit Tour
                            Rory Winston




Monday, 27 February 2012
Agenda
                   • A quick overview and tour of:
                    •R
                    • Python
                    • Java/C++
                   • For data analysis/analytics applications
                   • Comparison
Monday, 27 February 2012
Purpose

                   • To give a feeling for the relative advantages
                           and disadvantages of each approach
                   • Understand the tradeoffs involved
                   • See some demos


Monday, 27 February 2012
R
                   •       R is a domain-specific-language (DSL) for statistics
                           and data analysis
                   •       Functional-based language
                   •       Based on an earlier language called S
                   •       Core engine written in C
                   •       Open-source
                   •       Popularity has exploded in the last few years
                   •       Some commercial support


Monday, 27 February 2012
Pros
                   •       R is the de facto standard in statistical analysis tooling
                   •       Incredible range of functionality via contributed libraries
                   •       Powerful interactive analysis environment and
                           visualization tools
                   •       Large number of built-in datasets
                   •       Cross-platform
                   •       Broad user community
                   •       Wide range of resources (books, tutorials, papers)
                           available



Monday, 27 February 2012
Cons
                   • Performance limitations
                   • Single-threaded interpreter
                   • Language limitations and quirks
                   • Initial learning curve may be steep
                   • R gives you a lot of power, but assumes
                           you know how to use it!


Monday, 27 February 2012
Language Features
                   •       R is vectorized:
                           •   Loops are not required for many operations
                               (and are actually discouraged)
                   •       R is functional:
                           •   Functions can be passed around like other
                               variables
                   •       R integrates with a BLAS:
                           •   high-performance numerical operations


Monday, 27 February 2012
Demo

                   • Console R
                   • R GUI
                   • RStudio


Monday, 27 February 2012
Tips

                   • Learn how to use ggplot2 (http://had.co.nz/
                           ggplot2/)
                   • Consider using RStudio (http://
                           www.rstudio.org)




Monday, 27 February 2012
Python

                   • Initially developed in the late 1980s
                   • Object-oriented / functional support
                   • Open-source
                   • Initially popular in web applications, now
                           popular across a number of domains



Monday, 27 February 2012
Pros
                   • Very readable, simple and clear syntax
                   • Well-supported (many libraries and
                           extensions)
                   • Easy to integrate with other languages (e.g.
                           C)
                   • Very efficient environment to develop in

Monday, 27 February 2012
Cons
                   • Language syntax is not universally popular
                   • In terms of analytics, many libraries are still
                           slightly immature
                   • Performance can be lacking (although there
                           are many options to tune it)
                   • Interpreter is effectively single-threaded

Monday, 27 February 2012
Python + Analytics
                   •       There are a number of excellent libraries available
                           for analytics applications:
                           •   NumPy + SciPy
                           •   matplotlib
                           •   pandas
                           •   scikits
                   •       Some packages (e.g. pandas) are designed to replicate
                           the ‘feel’ and functionality of analysis operations in R


Monday, 27 February 2012
NumPy + SciPy

                   • Using NumPy + SciPy + matplotlib provides
                           an experience similar to using an
                           interactive R/Matlab environment
                   • Supports vectorization and BLAS
                           integration
                   • Add ipython for more goodness

Monday, 27 February 2012
Tips

                   • Use ipython!
                   • Check out:
                    • http://pandas.pydata.org/
                    • http://statsmodels.sourceforge.net/
                    • http://scikit-learn.org

Monday, 27 February 2012
Comparisons
                       x <- 1:10
                                                x = arange(1,11)
                       x <- seq(1, 2, .2)
                                                x = arange(1,2,.2)
                       x <- seq(1,2,
                                                x = linspace(1,2,15)
                       length.out=15)
                                                M <- arange
                       M <- matrix(1:100, 10,
                                                (1,101).reshape(10,10)
                       10)
                                                x[x < 1.5]
                       x[ x < 1.5 ]
                                                X = colstack((a,b))
                       X <- cbind(a,b)




Monday, 27 February 2012
Java/C++

                   • The ultimate in power/flexibility
                   • Also the ultimate in development time and
                           effort
                   • Lets just look at C++ briefly


Monday, 27 February 2012
C++
                   •       Old but still very popular
                   •       Just had a revamp (C++11, was C++0x)
                   •       Mostly competes with Java on the server side
                   •       Everything else (JVM, R, Python) is written in C/C
                           ++
                   •       Both R and Python provide easy ways to interface
                           with C/C++ code
                           •   This is used a lot


Monday, 27 February 2012
Pros

                   • Flexibility
                   • Lots of libraries available
                   • Control of resources for performance-
                           critical apps (e.g. memory)
                   • C++11 adds a lot of nice stuff (finally)

Monday, 27 February 2012
Cons
                   •       Lots of effort
                   •       Lots of hidden traps for the unwary
                   •       Initial experience may be a large productivity
                           hit
                   •       Effort in porting between systems
                   •       There is “modern” C++ (which is actually
                           pretty nice) and everything else (which isn’t so
                           nice)


Monday, 27 February 2012
Examples
                   •       Lets look at a sample library
                   •       This one is called Armadillo (http://
                           arma.sourceforge.net/)
                   •       Developed in Australia (NICTA / Univ.
                           Queensland)
                   •       Contains functions for numerical applications
                           and some statistical functions
                   •       Modern, efficient use of C++


Monday, 27 February 2012
Armadillo

                   • Armadillo supports vectorized operations
                   • Also integrates with a BLAS
                   • Example (see console)


Monday, 27 February 2012
Simple Example

                   • Using the Box-Jenkins airline passenger
                           data
                   • Classic dataset
                   • 12 years of monthly airline passenger
                           observations (144 in all)



Monday, 27 February 2012
Passenger Dataset




Monday, 27 February 2012
Linear Model
                • We will use a simple linear model (explains
                           85% of the variance of this data)


                                        Ax = b
                                            
                                           1        t1
                                         1         t2 
                                       A=
                                         1
                                                        
                                                    t3 
                                          ...       ...


Monday, 27 February 2012
Conclusion
                   • Use the toolkit that’s most appropriate for
                           you
                           • Common approches are to use e.g. R for
                             prototyping and model selection and (if
                             required) switch to a higher-performance
                             implementation for production
                   • If you have time, learn all of them!
Monday, 27 February 2012
Language Map
                 Dynamic Typing                                        Static Typing
                                        Interactivity




                              R
                                           Python               Java
                            Octave
                                            Ruby               C/C++




                                     Performance, complexity




Monday, 27 February 2012
Resources




Monday, 27 February 2012

An Analytics Toolkit Tour

  • 1.
    A Programming Language/Toolkit Tour Rory Winston Monday, 27 February 2012
  • 2.
    Agenda • A quick overview and tour of: •R • Python • Java/C++ • For data analysis/analytics applications • Comparison Monday, 27 February 2012
  • 3.
    Purpose • To give a feeling for the relative advantages and disadvantages of each approach • Understand the tradeoffs involved • See some demos Monday, 27 February 2012
  • 4.
    R • R is a domain-specific-language (DSL) for statistics and data analysis • Functional-based language • Based on an earlier language called S • Core engine written in C • Open-source • Popularity has exploded in the last few years • Some commercial support Monday, 27 February 2012
  • 5.
    Pros • R is the de facto standard in statistical analysis tooling • Incredible range of functionality via contributed libraries • Powerful interactive analysis environment and visualization tools • Large number of built-in datasets • Cross-platform • Broad user community • Wide range of resources (books, tutorials, papers) available Monday, 27 February 2012
  • 6.
    Cons • Performance limitations • Single-threaded interpreter • Language limitations and quirks • Initial learning curve may be steep • R gives you a lot of power, but assumes you know how to use it! Monday, 27 February 2012
  • 7.
    Language Features • R is vectorized: • Loops are not required for many operations (and are actually discouraged) • R is functional: • Functions can be passed around like other variables • R integrates with a BLAS: • high-performance numerical operations Monday, 27 February 2012
  • 8.
    Demo • Console R • R GUI • RStudio Monday, 27 February 2012
  • 9.
    Tips • Learn how to use ggplot2 (http://had.co.nz/ ggplot2/) • Consider using RStudio (http:// www.rstudio.org) Monday, 27 February 2012
  • 10.
    Python • Initially developed in the late 1980s • Object-oriented / functional support • Open-source • Initially popular in web applications, now popular across a number of domains Monday, 27 February 2012
  • 11.
    Pros • Very readable, simple and clear syntax • Well-supported (many libraries and extensions) • Easy to integrate with other languages (e.g. C) • Very efficient environment to develop in Monday, 27 February 2012
  • 12.
    Cons • Language syntax is not universally popular • In terms of analytics, many libraries are still slightly immature • Performance can be lacking (although there are many options to tune it) • Interpreter is effectively single-threaded Monday, 27 February 2012
  • 13.
    Python + Analytics • There are a number of excellent libraries available for analytics applications: • NumPy + SciPy • matplotlib • pandas • scikits • Some packages (e.g. pandas) are designed to replicate the ‘feel’ and functionality of analysis operations in R Monday, 27 February 2012
  • 14.
    NumPy + SciPy • Using NumPy + SciPy + matplotlib provides an experience similar to using an interactive R/Matlab environment • Supports vectorization and BLAS integration • Add ipython for more goodness Monday, 27 February 2012
  • 15.
    Tips • Use ipython! • Check out: • http://pandas.pydata.org/ • http://statsmodels.sourceforge.net/ • http://scikit-learn.org Monday, 27 February 2012
  • 16.
    Comparisons x <- 1:10 x = arange(1,11) x <- seq(1, 2, .2) x = arange(1,2,.2) x <- seq(1,2, x = linspace(1,2,15) length.out=15) M <- arange M <- matrix(1:100, 10, (1,101).reshape(10,10) 10) x[x < 1.5] x[ x < 1.5 ] X = colstack((a,b)) X <- cbind(a,b) Monday, 27 February 2012
  • 17.
    Java/C++ • The ultimate in power/flexibility • Also the ultimate in development time and effort • Lets just look at C++ briefly Monday, 27 February 2012
  • 18.
    C++ • Old but still very popular • Just had a revamp (C++11, was C++0x) • Mostly competes with Java on the server side • Everything else (JVM, R, Python) is written in C/C ++ • Both R and Python provide easy ways to interface with C/C++ code • This is used a lot Monday, 27 February 2012
  • 19.
    Pros • Flexibility • Lots of libraries available • Control of resources for performance- critical apps (e.g. memory) • C++11 adds a lot of nice stuff (finally) Monday, 27 February 2012
  • 20.
    Cons • Lots of effort • Lots of hidden traps for the unwary • Initial experience may be a large productivity hit • Effort in porting between systems • There is “modern” C++ (which is actually pretty nice) and everything else (which isn’t so nice) Monday, 27 February 2012
  • 21.
    Examples • Lets look at a sample library • This one is called Armadillo (http:// arma.sourceforge.net/) • Developed in Australia (NICTA / Univ. Queensland) • Contains functions for numerical applications and some statistical functions • Modern, efficient use of C++ Monday, 27 February 2012
  • 22.
    Armadillo • Armadillo supports vectorized operations • Also integrates with a BLAS • Example (see console) Monday, 27 February 2012
  • 23.
    Simple Example • Using the Box-Jenkins airline passenger data • Classic dataset • 12 years of monthly airline passenger observations (144 in all) Monday, 27 February 2012
  • 24.
  • 25.
    Linear Model • We will use a simple linear model (explains 85% of the variance of this data) Ax = b   1 t1 1 t2  A= 1  t3  ... ... Monday, 27 February 2012
  • 26.
    Conclusion • Use the toolkit that’s most appropriate for you • Common approches are to use e.g. R for prototyping and model selection and (if required) switch to a higher-performance implementation for production • If you have time, learn all of them! Monday, 27 February 2012
  • 27.
    Language Map Dynamic Typing Static Typing Interactivity R Python Java Octave Ruby C/C++ Performance, complexity Monday, 27 February 2012
  • 28.