An Analytics Toolkit Tour

1,301 views

Published on

A quick tour and overview of toolkits in R, Python and C++ for analytics applications.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,301
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
17
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

An Analytics Toolkit Tour

  1. 1. A Programming Language/Toolkit Tour Rory WinstonMonday, 27 February 2012
  2. 2. Agenda • A quick overview and tour of: •R • Python • Java/C++ • For data analysis/analytics applications • ComparisonMonday, 27 February 2012
  3. 3. Purpose • To give a feeling for the relative advantages and disadvantages of each approach • Understand the tradeoffs involved • See some demosMonday, 27 February 2012
  4. 4. R • R is a domain-specific-language (DSL) for statistics and data analysis • Functional-based language • Based on an earlier language called S • Core engine written in C • Open-source • Popularity has exploded in the last few years • Some commercial supportMonday, 27 February 2012
  5. 5. Pros • R is the de facto standard in statistical analysis tooling • Incredible range of functionality via contributed libraries • Powerful interactive analysis environment and visualization tools • Large number of built-in datasets • Cross-platform • Broad user community • Wide range of resources (books, tutorials, papers) availableMonday, 27 February 2012
  6. 6. Cons • Performance limitations • Single-threaded interpreter • Language limitations and quirks • Initial learning curve may be steep • R gives you a lot of power, but assumes you know how to use it!Monday, 27 February 2012
  7. 7. Language Features • R is vectorized: • Loops are not required for many operations (and are actually discouraged) • R is functional: • Functions can be passed around like other variables • R integrates with a BLAS: • high-performance numerical operationsMonday, 27 February 2012
  8. 8. Demo • Console R • R GUI • RStudioMonday, 27 February 2012
  9. 9. Tips • Learn how to use ggplot2 (http://had.co.nz/ ggplot2/) • Consider using RStudio (http:// www.rstudio.org)Monday, 27 February 2012
  10. 10. Python • Initially developed in the late 1980s • Object-oriented / functional support • Open-source • Initially popular in web applications, now popular across a number of domainsMonday, 27 February 2012
  11. 11. Pros • Very readable, simple and clear syntax • Well-supported (many libraries and extensions) • Easy to integrate with other languages (e.g. C) • Very efficient environment to develop inMonday, 27 February 2012
  12. 12. Cons • Language syntax is not universally popular • In terms of analytics, many libraries are still slightly immature • Performance can be lacking (although there are many options to tune it) • Interpreter is effectively single-threadedMonday, 27 February 2012
  13. 13. Python + Analytics • There are a number of excellent libraries available for analytics applications: • NumPy + SciPy • matplotlib • pandas • scikits • Some packages (e.g. pandas) are designed to replicate the ‘feel’ and functionality of analysis operations in RMonday, 27 February 2012
  14. 14. NumPy + SciPy • Using NumPy + SciPy + matplotlib provides an experience similar to using an interactive R/Matlab environment • Supports vectorization and BLAS integration • Add ipython for more goodnessMonday, 27 February 2012
  15. 15. Tips • Use ipython! • Check out: • http://pandas.pydata.org/ • http://statsmodels.sourceforge.net/ • http://scikit-learn.orgMonday, 27 February 2012
  16. 16. Comparisons x <- 1:10 x = arange(1,11) x <- seq(1, 2, .2) x = arange(1,2,.2) x <- seq(1,2, x = linspace(1,2,15) length.out=15) M <- arange M <- matrix(1:100, 10, (1,101).reshape(10,10) 10) x[x < 1.5] x[ x < 1.5 ] X = colstack((a,b)) X <- cbind(a,b)Monday, 27 February 2012
  17. 17. Java/C++ • The ultimate in power/flexibility • Also the ultimate in development time and effort • Lets just look at C++ brieflyMonday, 27 February 2012
  18. 18. C++ • Old but still very popular • Just had a revamp (C++11, was C++0x) • Mostly competes with Java on the server side • Everything else (JVM, R, Python) is written in C/C ++ • Both R and Python provide easy ways to interface with C/C++ code • This is used a lotMonday, 27 February 2012
  19. 19. Pros • Flexibility • Lots of libraries available • Control of resources for performance- critical apps (e.g. memory) • C++11 adds a lot of nice stuff (finally)Monday, 27 February 2012
  20. 20. Cons • Lots of effort • Lots of hidden traps for the unwary • Initial experience may be a large productivity hit • Effort in porting between systems • There is “modern” C++ (which is actually pretty nice) and everything else (which isn’t so nice)Monday, 27 February 2012
  21. 21. Examples • Lets look at a sample library • This one is called Armadillo (http:// arma.sourceforge.net/) • Developed in Australia (NICTA / Univ. Queensland) • Contains functions for numerical applications and some statistical functions • Modern, efficient use of C++Monday, 27 February 2012
  22. 22. Armadillo • Armadillo supports vectorized operations • Also integrates with a BLAS • Example (see console)Monday, 27 February 2012
  23. 23. Simple Example • Using the Box-Jenkins airline passenger data • Classic dataset • 12 years of monthly airline passenger observations (144 in all)Monday, 27 February 2012
  24. 24. Passenger DatasetMonday, 27 February 2012
  25. 25. Linear Model • We will use a simple linear model (explains 85% of the variance of this data) Ax = b   1 t1 1 t2  A= 1  t3  ... ...Monday, 27 February 2012
  26. 26. Conclusion • Use the toolkit that’s most appropriate for you • Common approches are to use e.g. R for prototyping and model selection and (if required) switch to a higher-performance implementation for production • If you have time, learn all of them!Monday, 27 February 2012
  27. 27. Language Map Dynamic Typing Static Typing Interactivity R Python Java Octave Ruby C/C++ Performance, complexityMonday, 27 February 2012
  28. 28. ResourcesMonday, 27 February 2012

×