Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

1,301 views

Published on

A quick tour and overview of toolkits in R, Python and C++ for analytics applications.

Published in:
Technology

No Downloads

Total views

1,301

On SlideShare

0

From Embeds

0

Number of Embeds

27

Shares

0

Downloads

17

Comments

0

Likes

2

No embeds

No notes for slide

- 1. A Programming Language/Toolkit Tour Rory WinstonMonday, 27 February 2012
- 2. Agenda • A quick overview and tour of: •R • Python • Java/C++ • For data analysis/analytics applications • ComparisonMonday, 27 February 2012
- 3. Purpose • To give a feeling for the relative advantages and disadvantages of each approach • Understand the tradeoffs involved • See some demosMonday, 27 February 2012
- 4. R • R is a domain-speciﬁc-language (DSL) for statistics and data analysis • Functional-based language • Based on an earlier language called S • Core engine written in C • Open-source • Popularity has exploded in the last few years • Some commercial supportMonday, 27 February 2012
- 5. Pros • R is the de facto standard in statistical analysis tooling • Incredible range of functionality via contributed libraries • Powerful interactive analysis environment and visualization tools • Large number of built-in datasets • Cross-platform • Broad user community • Wide range of resources (books, tutorials, papers) availableMonday, 27 February 2012
- 6. Cons • Performance limitations • Single-threaded interpreter • Language limitations and quirks • Initial learning curve may be steep • R gives you a lot of power, but assumes you know how to use it!Monday, 27 February 2012
- 7. Language Features • R is vectorized: • Loops are not required for many operations (and are actually discouraged) • R is functional: • Functions can be passed around like other variables • R integrates with a BLAS: • high-performance numerical operationsMonday, 27 February 2012
- 8. Demo • Console R • R GUI • RStudioMonday, 27 February 2012
- 9. Tips • Learn how to use ggplot2 (http://had.co.nz/ ggplot2/) • Consider using RStudio (http:// www.rstudio.org)Monday, 27 February 2012
- 10. Python • Initially developed in the late 1980s • Object-oriented / functional support • Open-source • Initially popular in web applications, now popular across a number of domainsMonday, 27 February 2012
- 11. Pros • Very readable, simple and clear syntax • Well-supported (many libraries and extensions) • Easy to integrate with other languages (e.g. C) • Very efﬁcient environment to develop inMonday, 27 February 2012
- 12. Cons • Language syntax is not universally popular • In terms of analytics, many libraries are still slightly immature • Performance can be lacking (although there are many options to tune it) • Interpreter is effectively single-threadedMonday, 27 February 2012
- 13. Python + Analytics • There are a number of excellent libraries available for analytics applications: • NumPy + SciPy • matplotlib • pandas • scikits • Some packages (e.g. pandas) are designed to replicate the ‘feel’ and functionality of analysis operations in RMonday, 27 February 2012
- 14. NumPy + SciPy • Using NumPy + SciPy + matplotlib provides an experience similar to using an interactive R/Matlab environment • Supports vectorization and BLAS integration • Add ipython for more goodnessMonday, 27 February 2012
- 15. Tips • Use ipython! • Check out: • http://pandas.pydata.org/ • http://statsmodels.sourceforge.net/ • http://scikit-learn.orgMonday, 27 February 2012
- 16. Comparisons x <- 1:10 x = arange(1,11) x <- seq(1, 2, .2) x = arange(1,2,.2) x <- seq(1,2, x = linspace(1,2,15) length.out=15) M <- arange M <- matrix(1:100, 10, (1,101).reshape(10,10) 10) x[x < 1.5] x[ x < 1.5 ] X = colstack((a,b)) X <- cbind(a,b)Monday, 27 February 2012
- 17. Java/C++ • The ultimate in power/ﬂexibility • Also the ultimate in development time and effort • Lets just look at C++ brieﬂyMonday, 27 February 2012
- 18. C++ • Old but still very popular • Just had a revamp (C++11, was C++0x) • Mostly competes with Java on the server side • Everything else (JVM, R, Python) is written in C/C ++ • Both R and Python provide easy ways to interface with C/C++ code • This is used a lotMonday, 27 February 2012
- 19. Pros • Flexibility • Lots of libraries available • Control of resources for performance- critical apps (e.g. memory) • C++11 adds a lot of nice stuff (ﬁnally)Monday, 27 February 2012
- 20. Cons • Lots of effort • Lots of hidden traps for the unwary • Initial experience may be a large productivity hit • Effort in porting between systems • There is “modern” C++ (which is actually pretty nice) and everything else (which isn’t so nice)Monday, 27 February 2012
- 21. Examples • Lets look at a sample library • This one is called Armadillo (http:// arma.sourceforge.net/) • Developed in Australia (NICTA / Univ. Queensland) • Contains functions for numerical applications and some statistical functions • Modern, efﬁcient use of C++Monday, 27 February 2012
- 22. Armadillo • Armadillo supports vectorized operations • Also integrates with a BLAS • Example (see console)Monday, 27 February 2012
- 23. Simple Example • Using the Box-Jenkins airline passenger data • Classic dataset • 12 years of monthly airline passenger observations (144 in all)Monday, 27 February 2012
- 24. Passenger DatasetMonday, 27 February 2012
- 25. Linear Model • We will use a simple linear model (explains 85% of the variance of this data) Ax = b 1 t1 1 t2 A= 1 t3 ... ...Monday, 27 February 2012
- 26. Conclusion • Use the toolkit that’s most appropriate for you • Common approches are to use e.g. R for prototyping and model selection and (if required) switch to a higher-performance implementation for production • If you have time, learn all of them!Monday, 27 February 2012
- 27. Language Map Dynamic Typing Static Typing Interactivity R Python Java Octave Ruby C/C++ Performance, complexityMonday, 27 February 2012
- 28. ResourcesMonday, 27 February 2012

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment