Gur1009
Upcoming SlideShare
Loading in...5
×
 

Gur1009

on

  • 1,907 views

 

Statistics

Views

Total Views
1,907
Views on SlideShare
996
Embed Views
911

Actions

Likes
0
Downloads
13
Comments
0

5 Embeds 911

http://fltaur.wordpress.com 821
https://fltaur.wordpress.com 87
https://www.google.fr 1
http://webcache.googleusercontent.com 1
http://reader.aol.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Gur1009 Gur1009 Presentation Transcript

  • (not so) Big Data with R GUR Fltaur Matthieu Cornec Matthieu.cornec@cdiscount.com 10/09/2013 Cdiscount.com - Commark
  • Outline 2 • A- Intro • B- Problem setup • C- 3 strategies • D- Packages : Rsqlite, ff and biglm, data.sample • E- Conclusion Cdiscount.com - Commark
  • 1 – Intro 3Cdiscount.com - Commark Problem setup - Your csv file is too big to import into R. Say multiple of 10GO, - Typically, your first read.table ends up with an error message « Cannot allocate a vector of size XXX » How to fix it? It depends on: - What you want to do (data management sql like queries, datamining,…) - Your environnment (Corporate with a Datawarehouse?) - The size of your data
  • Three basic strategies 4Cdiscount.com - Commark • Buy memory in a cloud environnement - Can handle multiple 10Go - Cheap (1,5 euro per hour for 60Go) - No need to rewrite all your code But you need to configure it (see for example )  Preferred strategy in most cases • Try packages for SQL-like needs, try ff, rsqlite - Not limited to RAM (multiple 10Go) But no advanced datamining libraries And you need to rewrite your code…. • Sampling :data.sample package
  • Dataset 5Cdiscount.com - Commark • http://stat-computing.org/dataexpo/2009/the-data.html • More 100 million observations, 12 G0 The data comes originally from RITA where it is described in detail. You can download the data there, or from the bzipped csv files listed below. These files have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals. Download individual years: 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,2002, 2003, 2004, 2005, 2006, 2007, 2008 29 variables Name Description 1 Year 1987-2008 2 Month 1-12 3 DayofMonth 1-31 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure time (local, hhmm) 6 ….
  • 1 Import the data files and create one unique large csv file 6Cdiscount.com - Commark ##import the data from http://stat-computing.org/dataexpo/2009/the-data.html for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") if ( !file.exists(file.name) ) { url.text <- paste("http://stat-computing.org/dataexpo/2009/", year, ".csv.bz2", sep = "") cat("Downloading missing data file ", file.name, "n", sep = "") download.file(url.text, file.name) } } ##create a unique large data file named airlines.csv by first <- TRUE csv.file <- "airlines.csv" # Write combined integer-only data to this file csv.con <- file(csv.file, open = "w") system.time( for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") cat("Processing ", file.name, "n", sep = "") d <- read.csv(file.name) ## Convert the strings to integers write.table(d, file = csv.con, sep = ",", row.names = FALSE, col.names = first) first <- FALSE } ) close(csv.con)
  • BigMemory Package 7Cdiscount.com - Commark ##09/09/2013: does not seem to exist on windows for R.3.0.0 install.packages("bigmemory", repos="http://R-Forge.R- project.org") install.packages("biganalytics", repos="http://R-Forge.R- project.org") #library(bigmemory) #x <-read.big.matrix("airlines.csv", type ="integer", header = TRUE ,backingfile ="airline.bin", # descriptorfile ="airline.desc",extraCols ="Age") #library(biganalytics) #blm <- biglm.big.matrix(ArrDelay~Age+Year,data=x)
  • ff package 8Cdiscount.com - Commark library(ffbase) system.time(hhp <- read.table.ffdf(file="airlines.csv", FUN = "read.csv", na.strings = "NA", nrows=10000000)) #takes 1min40sec #with no nrows arguement, message error, # ffbase does not support char type class(hhp) dim(hhp) str(hhp[1:10,]) result <- list() ## Some basic showoff result$UniqueCarrier <- unique(hhp$UniqueCarrier) #15 sec ## Basic example of operators is.na.ff, the ! operator and sum.ff sum(!is.na(hhp$ArrDelay )) ## all and any any(is.na(hhp$ArrDelay)) all(!is.na(hhp$ArrDelay))
  • ff package and Biglm 9Cdiscount.com - Commark ## ## Make a linear model using biglm ## require(biglm) mymodel <- bigglm(ArrDelay ~ -1+DayOfWeek, data =hhp) #takes 30 sec for 10M rows summary(mymodel) predict(mymodel,newdata=hhp)
  • RSQLITE 10Cdiscount.com - Commark library(RSQLite) library(sqldf) library(foreign) # create an empty database. # can skip this step if database already exists. # read into table called iris in the testingdb sqlite database sqldf("attach testingdb as new") read.csv.sql("airlines.csv", sql = "create table baseflux as select * from file", dbname = "testingdb",row.names=F, eol="n") #on Windows, specifiy eol="n" #takes 2,5 hours # look at first three lines sqldf("select * from baseflux limit 10", dbname = "testingdb") #takes 1 minute ? #count the number of flights whose distance is greater than 500, departing from SF sqldf("select count(*) as nb from baseflux where distance>500 and Origin='SFO'" , dbname = "testingdb")
  • Rsqlite 11Cdiscount.com - Commark ##If your intention was to read the file into R immediately after #reading it into the database #and you don't really need the database after that then see airlines <- read.csv.sql("airlines.csv", sql = "select * from file",eol="n") ###### #NB: the package does not handle missing value, #Translate the empty fields to some number #that will represent NA and then fix it up on the R end.
  • Sampling is bad for... 12Cdiscount.com - Commark • Reporting The boss wants to know the accurate growth rate, not a statistical estimation... • Data management You will not be able to access the role of this particular customer
  • Sampling is good for analysis 13Cdiscount.com - Commark Because 1 what matters is the order of magnitude, not the accurate results 2. sampling error is very small compared to Model error, Measurement errors, estimation error, Model noise,... 3 sampling error depends on the size of the sample, not on the whole dataset. 4 everything is a sample at the end 5 when sampling works very bad, then your conclusions are not robust 6 Anyway, how will we deal with non linear complexity, even in the cloud?
  • data.sample 14Cdiscount.com - Commark Features of data.sample • it works on your laptop, whatever your RAM is, it just takes time • no need to install other Big Data soft (RBD, NoSQL) on top of R • no need to rewrite all your code, just change one single line data.sample takes the same arguments as read.table: nothing to learn Simulations Model Y = 3X +1fG=Ag+21fG=Bg+31fG=Cg+e X = 1; :::;N, G discrete random variables, e some noise Simulate 100 millions observations: 2.3Go Code dataset<-data.sample(simulations.csv,sep=,,header=T) #takes 12min on my laptop t<-lm(y.,data=dataset) summary(t) Call: lm(formula = y ~ -1 + x + g, data = dataset) Coecients: x gA gB gC 3.0000 0.9984 1.9996 2.9963
  • data.sample package 15Cdiscount.com - Commark install.packages("D:/U/Data.sample/data.s ample_1.0.zip", repos = NULL) library(data.sample) system.time(resultsample<- data.sample(file="airlines.csv",header=T,s ep=",")$df) #takes 52 minutes on my laptop if you don’t know the number of records # this step is done only once!
  • data.sample package 16Cdiscount.com - Commark #fit your linear model mymodelsample <- lm(ArrDelay ~ -1+as.factor(DayOfWeek), data =resultsample) Summary(mymodelsample) Estimate Std. Error t value Pr(>|t|) as.factor(DayOfWe ek)1 6.58383 0.08041 81.88 <2e-16 *** as.factor(DayOfWe ek)2 6.04881 0.08054 75.10 <2e-16 *** as.factor(DayOfWe ek)3 6.80039 0.08037 84.61 <2e-16 *** as.factor(DayOfWe ek)4 8.96406 0.08045 111.42 <2e-16 *** as.factor(DayOfWe ek)5 9.45303 0.08015 117.94 <2e-16 *** as.factor(DayOfWe ek)6 4.15234 0.08535 48.65 <2e-16 *** as.factor(DayOfWe ek)7 6.40236 0.08222 77.87 <2e-16 ***
  • data.sample package 17Cdiscount.com - Commark
  • Conclusion 18Cdiscount.com - Commark SQL like Datamining strategies Beyond the RAM Pros Cons cloud OK OK OK No rewrite, cheap Cloud configuratio n Ff, biglm OK KO but regression OK Not limited to RAM Rewrite, very limited for datamining rsqlite OK KO OK Not limited to RAM Rewrite, no datamining Data.sample OK OK OK No rewrite, fast coding, can use all libraries No reporting, lack of theoretical results Data.table OK KO KO Limited to RAM, no datamining Fast (index)