Introduction to the Future of R Avram Aelony November 2010Wednesday, November 17, 2010
Talk Outline: 1. Strengths II. Criticisms III. Challenges IV. Remedies and Solutions V. The FutureWednesday, November 17, 2010
Quick disclaimer: - I don’t consider myself an R expert - I don’t have a crystal ball informing of the Future - This talk is about polite observations - The future is dynamic YMMD <- your-mileage-may-differ()Wednesday, November 17, 2010 ?
R’s Strengths - a many good things, too many to mention individually ... but let’s try...Wednesday, November 17, 2010
Strengths of R - A high quality statistical platform, yielding reproducible results - Open Source, free and available - Large, active community - Intuitive language structure - Data as rows and columns - Package plugin architecture - there are many packages, top packages in widespread use - Distributed contributions written/offered/controlled by many/multiple individuals - Data processing for most individual needs. - Emerging success and increasing corporate adoption e.g. some corporate needs (often used for prototyping and adhoc analytics)Wednesday, November 17, 2010
Strengths of R More succinctly... based on a paraphrasing of a post by Ted Dunning * 1. Library II. Language III. Community * http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ the_future_of_r.htmlWednesday, November 17, 2010
Criticisms of R - Small grievances: syntax, elegance, and managing complexity “Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous.” -Bill Venables, quote from 2007 http://firstname.lastname@example.org/msg06853.html “...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120...” - comment taken from Gelman blog on the future of R. http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html - Larger grievances: memory and inefﬁciency “One of the most vexing issues in R is memory. For anyone who works with large datasets - even if you have 64-bit R running and lots (e.g., 18Gb) of RAM, memory can still confound, frustrate, and stymie even experienced R users.” http://www.matthewckeller.com/html/memory.htmlWednesday, November 17, 2010
However, greater challenges for R lie ahead 1. Big Data is coming... II. Isn’t Big Data already here ? How can we imagine an ideal environment to address Big Data?Wednesday, November 17, 2010
- What is Big Data? "Every 2 Days We Create As Much Information As We Did Up To 2003" - Eric Schmidt, Chairman & CEO, Google. http://techcrunch.com/2010/08/04/schmidt-data/ "Data is abundant, Information is useful, Knowledge is precious." http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html - Freshness, this data will self destruct in 5 seconds... !! "How Much Time Do You Have Before Web‐Generated Leads Go Cold?" http://www.matrixintegratedmarketing.com/MIT.pdf Get ready: “Web Scale Big Data - 100’s of Terabytes” -John Sichi, Facebook, on intended usage with Hive. http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6.Wednesday, November 17, 2010
What is Big Data? Wikipedia - http://en.wikipedia.org/wiki/Big_dataWednesday, November 17, 2010 ?
Solving the “Big” Data problem ... as I see it, there are 5 competing possible solution “avenues”Wednesday, November 17, 2010
The “Big” Data problem: Solution #1 Use R in Conjunction with other specialized tools. Examples: - R remains a language for small datasets but has “hooks” and “bridges” that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading, others...)Wednesday, November 17, 2010
The “Big” Data problem: Solution #2 Packages that enable new functionality for reading and processing very large data sets Examples: - Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment) - Kane & Emerson’s bigmemory - Adler et al.‘s ff package - Henrik Bengtsson’s R.huge package (deprecated) - (many new yet-to-be-developed possibilities here ) So.... enhance functions, but no enhancements to the core languageWednesday, November 17, 2010
The “Big” Data problem: Solution #3 Same language but have R “do the right thing” under the hood. Examples: - Out of memory algorithms, think: “I see you’re trying to analyze a sizable amount of data...” - Either seamlessly or after user approval to go ahead... # perhaps, perhaps... d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE) or if possible, enhance core language as well as functionality!!!Wednesday, November 17, 2010
The “Big” Data problem: Solution #4 - Completely start over 2008 http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdfWednesday, November 17, 2010
The “Big” Data problem: 2010 http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdfWednesday, November 17, 2010
The “Big” Data problem: The Ihaka/Lang “Back to the Future” paper came out in 2008. The Ihaka “Lessons Learned” 2010 paper mentions: - the need of an “effective language for handling large-scale computations” - nostalgia for Lisp Have there been any Lisp-like advances since then? What about Clojure ?Wednesday, November 17, 2010
The “Big” Data problem: Solution #5 - Does Clojure ﬁt the bill ? H0: Clojure already has many of the things Ross Ihaka would ask for H1: Really? -Rich Hickey http://clojure.org/rationale Clojure may be seen as a solution, or as an example path for R to follow, improve upon, or choose to differ...Wednesday, November 17, 2010
Clojure -Rich Hickey http://clojure.orgWednesday, November 17, 2010
The problem with many new languages is that initially there are no libraries... Clojure already has many, and can use any Java library directly as necessary. - Core Clojure - Incanter: "a Clojure-based, R-like platform for statistical computing and graphics" http://incanter.org/ - Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/infer - Cascalog: “Data processing on Hadoop without the hassle” “a Clojure-based query language for Hadoop”Wednesday, November 17, 2010
What will the Future really hold for R ?Wednesday, November 17, 2010
Thanks for listening...Wednesday, November 17, 2010
Appendix: A few slides on Clojure, and three powerful Clojure libraries: Incanter Infer CascalogWednesday, November 17, 2010
Clojure - a quick tour -Rich Hickey http://clojure.orgWednesday, November 17, 2010
David Edgar Liebke’s Incanter Please see http://incanter.org/docs/data-sorcery-new.pdf for an excellent intro to Incanter.Wednesday, November 17, 2010
Below are example snippets from IncanterWednesday, November 17, 2010
Bradford Cross’ Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/inferWednesday, November 17, 2010
Nathan Marz’s Cascalog: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.htmlWednesday, November 17, 2010
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.