• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Los Angeles R users group - Nov 17 2010 - Part 2
 

Los Angeles R users group - Nov 17 2010 - Part 2

on

  • 4,267 views

 

Statistics

Views

Total Views
4,267
Views on SlideShare
2,589
Embed Views
1,678

Actions

Likes
0
Downloads
32
Comments
0

6 Embeds 1,678

http://www.r-bloggers.com 1630
http://static.slidesharecdn.com 40
http://r-bloggers.com 3
http://feeds.feedburner.com 2
http://translate.googleusercontent.com 2
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Los Angeles R users group - Nov 17 2010 - Part 2 Los Angeles R users group - Nov 17 2010 - Part 2 Presentation Transcript

    • Introduction to the Future of R Avram Aelony November 2010Wednesday, November 17, 2010
    • Talk Outline: 1. Strengths II. Criticisms III. Challenges IV. Remedies and Solutions V. The FutureWednesday, November 17, 2010
    • Quick disclaimer: - I don’t consider myself an R expert - I don’t have a crystal ball informing of the Future - This talk is about polite observations - The future is dynamic YMMD <- your-mileage-may-differ()Wednesday, November 17, 2010 ?
    • R’s Strengths - a many good things, too many to mention individually ... but let’s try...Wednesday, November 17, 2010
    • Strengths of R - A high quality statistical platform, yielding reproducible results - Open Source, free and available - Large, active community - Intuitive language structure - Data as rows and columns - Package plugin architecture - there are many packages, top packages in widespread use - Distributed contributions written/offered/controlled by many/multiple individuals - Data processing for most individual needs. - Emerging success and increasing corporate adoption e.g. some corporate needs (often used for prototyping and adhoc analytics)Wednesday, November 17, 2010
    • Strengths of R More succinctly... based on a paraphrasing of a post by Ted Dunning * 1. Library II. Language III. Community * http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ the_future_of_r.htmlWednesday, November 17, 2010
    • Criticisms of R - Small grievances: syntax, elegance, and managing complexity “Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous.” -Bill Venables, quote from 2007 http://www.mail-archive.com/r-help@r-project.org/msg06853.html “...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120...” - comment taken from Gelman blog on the future of R. http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html - Larger grievances: memory and inefficiency “One of the most vexing issues in R is memory. For anyone who works with large datasets - even if you have 64-bit R running and lots (e.g., 18Gb) of RAM, memory can still confound, frustrate, and stymie even experienced R users.” http://www.matthewckeller.com/html/memory.htmlWednesday, November 17, 2010
    • However, greater challenges for R lie ahead 1. Big Data is coming... II. Isn’t Big Data already here ? How can we imagine an ideal environment to address Big Data?Wednesday, November 17, 2010
    • - What is Big Data? "Every 2 Days We Create As Much Information As We Did Up To 2003" - Eric Schmidt, Chairman & CEO, Google. http://techcrunch.com/2010/08/04/schmidt-data/ "Data is abundant, Information is useful, Knowledge is precious." http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html - Freshness, this data will self destruct in 5 seconds... !! "How Much Time Do You Have Before Web‐Generated Leads Go Cold?" http://www.matrixintegratedmarketing.com/MIT.pdf Get ready: “Web Scale Big Data - 100’s of Terabytes” -John Sichi, Facebook, on intended usage with Hive. http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6.Wednesday, November 17, 2010
    • What is Big Data? Wikipedia - http://en.wikipedia.org/wiki/Big_dataWednesday, November 17, 2010 ?
    • Solving the “Big” Data problem ... as I see it, there are 5 competing possible solution “avenues”Wednesday, November 17, 2010
    • The “Big” Data problem: Solution #1 Use R in Conjunction with other specialized tools. Examples: - R remains a language for small datasets but has “hooks” and “bridges” that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading, others...)Wednesday, November 17, 2010
    • The “Big” Data problem: Solution #2 Packages that enable new functionality for reading and processing very large data sets Examples: - Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment) - Kane & Emerson’s bigmemory - Adler et al.‘s ff package - Henrik Bengtsson’s R.huge package (deprecated) - (many new yet-to-be-developed possibilities here ) So.... enhance functions, but no enhancements to the core languageWednesday, November 17, 2010
    • The “Big” Data problem: Solution #3 Same language but have R “do the right thing” under the hood. Examples: - Out of memory algorithms, think: “I see you’re trying to analyze a sizable amount of data...” - Either seamlessly or after user approval to go ahead... # perhaps, perhaps... d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE) or if possible, enhance core language as well as functionality!!!Wednesday, November 17, 2010
    • The “Big” Data problem: Solution #4 - Completely start over 2008 http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdfWednesday, November 17, 2010
    • The “Big” Data problem: 2010 http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdfWednesday, November 17, 2010
    • The “Big” Data problem: The Ihaka/Lang “Back to the Future” paper came out in 2008. The Ihaka “Lessons Learned” 2010 paper mentions: - the need of an “effective language for handling large-scale computations” - nostalgia for Lisp Have there been any Lisp-like advances since then? What about Clojure ?Wednesday, November 17, 2010
    • The “Big” Data problem: Solution #5 - Does Clojure fit the bill ? H0: Clojure already has many of the things Ross Ihaka would ask for H1: Really? -Rich Hickey http://clojure.org/rationale Clojure may be seen as a solution, or as an example path for R to follow, improve upon, or choose to differ...Wednesday, November 17, 2010
    • Clojure -Rich Hickey http://clojure.orgWednesday, November 17, 2010
    • The problem with many new languages is that initially there are no libraries... Clojure already has many, and can use any Java library directly as necessary. - Core Clojure - Incanter: "a Clojure-based, R-like platform for statistical computing and graphics" http://incanter.org/ - Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/infer - Cascalog: “Data processing on Hadoop without the hassle” “a Clojure-based query language for Hadoop”Wednesday, November 17, 2010
    • What will the Future really hold for R ?Wednesday, November 17, 2010
    • Thanks for listening...Wednesday, November 17, 2010
    • Appendix: A few slides on Clojure, and three powerful Clojure libraries: Incanter Infer CascalogWednesday, November 17, 2010
    • Clojure - a quick tour -Rich Hickey http://clojure.orgWednesday, November 17, 2010
    • David Edgar Liebke’s Incanter Please see http://incanter.org/docs/data-sorcery-new.pdf for an excellent intro to Incanter.Wednesday, November 17, 2010
    • Below are example snippets from IncanterWednesday, November 17, 2010
    • Bradford Cross’ Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/inferWednesday, November 17, 2010
    • Nathan Marz’s Cascalog: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.htmlWednesday, November 17, 2010