Los Angeles R users group - Nov 17 2010 - Part 2

Introduction to the Future of R

Avram Aelony
November 2010

Wednesday, November 17, 2010

Talk Outline:
1. Strengths

II. Criticisms

III. Challenges

IV. Remedies and Solutions

V. The Future


Quick disclaimer:
- I don’t consider myself an R expert

- I don’t have a crystal ball informing of the Future

- This talk is about polite observations

- The future is dynamic

YMMD <- your-mileage-may-differ()

?

R’s Strengths
- a many good things, too many to mention individually

... but let’s try...


Strengths of R

- A high quality statistical platform, yielding reproducible results

- Open Source, free and available

- Large, active community

- Intuitive language structure

- Data as rows and columns

- Package plugin architecture - there are many packages, top packages in widespread use

- Distributed contributions written/offered/controlled by many/multiple individuals

- Data processing for most individual needs.

- Emerging success and increasing corporate adoption
e.g. some corporate needs (often used for prototyping and adhoc analytics)


Strengths of R

More succinctly... based on a paraphrasing of a post by Ted Dunning *

1. Library

II. Language

III. Community

* http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/
the_future_of_r.html


Criticisms of R
- Small grievances: syntax, elegance, and managing complexity
“Most packages are very good, but I regret to say some are pretty inefficient and others downright
dangerous.”
-Bill Venables, quote from 2007
http://www.mail-archive.com/r-help@r-project.org/msg06853.html

“...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other
packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into
syntax like “GOTO 120...”
- comment taken from Gelman blog on the future of R.
http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html

- Larger grievances: memory and inefﬁciency
“One of the most vexing issues in R is memory. For anyone who works
with large datasets - even if you have 64-bit R running and lots (e.g.,
18Gb) of RAM, memory can still confound, frustrate, and stymie even
experienced R users.”

http://www.matthewckeller.com/html/memory.html


However, greater challenges for R lie ahead

1. Big Data is coming...

II. Isn’t Big Data already here ?

How can we imagine an ideal environment to address Big Data?


- What is Big Data?

"Every 2 Days We Create As Much Information As We Did Up To 2003"
- Eric Schmidt, Chairman & CEO, Google.
http://techcrunch.com/2010/08/04/schmidt-data/

"Data is abundant, Information is useful, Knowledge is precious."
http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html

- Freshness, this data will self destruct in 5 seconds... !!

"How Much Time Do You Have Before Web‐Generated Leads Go Cold?"
http://www.matrixintegratedmarketing.com/MIT.pdf

Get ready:
“Web Scale Big Data - 100’s of Terabytes”
-John Sichi, Facebook, on intended usage with Hive.
http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6.


What is Big Data?

Wikipedia - http://en.wikipedia.org/wiki/Big_data

?

Solving the “Big” Data problem

... as I see it,

there are 5 competing possible solution “avenues”


The “Big” Data problem:

Solution #1

Use R in Conjunction with other specialized tools.

Examples:
- R remains a language for small datasets but has “hooks” and “bridges”
that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading,
others...)



Solution #2
Packages that enable new functionality for reading
and processing very large data sets

Examples:
- Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment)
- Kane & Emerson’s bigmemory
- Adler et al.‘s ff package
- Henrik Bengtsson’s R.huge package (deprecated)
- (many new yet-to-be-developed possibilities here )

So....
enhance functions, but
no enhancements to the core language



Solution #3
Same language but have R “do the right thing”
under the hood.
Examples:
- Out of memory algorithms,
think: “I see you’re trying to analyze a sizable amount of data...”

- Either seamlessly or after user approval to go ahead...
# perhaps, perhaps...
d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE)

or if possible, enhance core language as well as
functionality!!!



Solution #4 - Completely start over

2008

http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdf



2010

http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdf



The Ihaka/Lang “Back to the Future” paper came out in 2008.

The Ihaka “Lessons Learned” 2010 paper mentions:

- the need of an “effective language for handling large-scale computations”

- nostalgia for Lisp

Have there been any Lisp-like advances since then?

What about Clojure ?



Solution #5 - Does Clojure ﬁt the bill ?
H0: Clojure already has many of the things Ross Ihaka would ask for
H1: Really?

-Rich Hickey
http://clojure.org/rationale

Clojure may be seen as a solution, or as an example path for R to
follow, improve upon, or choose to differ...


Clojure
-Rich Hickey
http://clojure.org


The problem with many new languages is that initially there are no libraries...

Clojure already has many, and can use any Java library directly as necessary.

- Core Clojure

- Incanter: "a Clojure-based, R-like platform for statistical computing and graphics"
http://incanter.org/

- Infer: "a (Clojure) library for machine learning and statistical inference,
designed to be used in real production systems."
https://github.com/bradford/infer

- Cascalog: “Data processing on Hadoop without the hassle”
“a Clojure-based query language for Hadoop”


What will the Future really hold for R ?


Thanks for listening...


Appendix:

A few slides on Clojure, and three
powerful Clojure libraries:

Incanter
Infer
Cascalog


Clojure - a quick tour
-Rich Hickey
http://clojure.org


David Edgar Liebke’s Incanter

Please see http://incanter.org/docs/data-sorcery-new.pdf
for an excellent intro to Incanter.


Below are example snippets from Incanter


Bradford Cross’ Infer:
"a (Clojure) library for machine learning and statistical inference, designed
to be used in real production systems."

https://github.com/bradford/infer


Nathan Marz’s Cascalog:
http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html


Los Angeles R users group - Nov 17 2010 - Part 2

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Los Angeles R users group - Nov 17 2010 - Part 2

Similar to Los Angeles R users group - Nov 17 2010 - Part 2 (20)

More from rusersla

More from rusersla (10)

Los Angeles R users group - Nov 17 2010 - Part 2