SlideShare is now on Android. 15 million presentations at your fingertips.  Get the app

×
  • Share
  • Email
  • Embed
  • Like
  • Private Content
 

Scalable Data Analysis in R -- Lee Edlefsen

by on Sep 07, 2011

  • 8,695 views

For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. ...

For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.

That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.

R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.

I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms

Statistics

Views

Total Views
8,695
Views on SlideShare
4,230
Embed Views
4,465

Actions

Likes
3
Downloads
157
Comments
0

22 Embeds 4,465

http://blog.revolutionanalytics.com 2980
http://www.r-bloggers.com 1337
http://paper.li 37
http://ikarama-analytics.blogspot.com 28
http://feeds.feedburner.com 27
http://www.hanrss.com 10
http://twitter.com 7
https://twitter.com 7
http://revolution-computing.typepad.com 7
http://quantboxx.com 5
http://i-karama-analytics.blogspot.com 4
http://www.newsblur.com 3
http://xianguo.com 2
http://translate.googleusercontent.com 2
http://webcache.googleusercontent.com 2
http://web-18.citromail.hu 1
http://smartdatacollective.com 1
http://tweetree.com 1
http://m205.mail.qq.com 1
http://www.onlydoo.com 1
http://youtubecaptions.appspot.com 1
http://ikarama-analytics.blogspot.de 1
More...

Accessibility

Categories

Upload Details

Uploaded via SlideShare as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
Post Comment
Edit your comment

Scalable Data Analysis in R -- Lee Edlefsen Scalable Data Analysis in R -- Lee Edlefsen Presentation Transcript