Gur1009

(not so) Big Data with R
GUR Fltaur
Matthieu Cornec
Matthieu.cornec@cdiscount.com
10/09/2013
Cdiscount.com - Commark

Outline
2
• A- Intro
• B- Problem setup
• C- 3 strategies
• D- Packages : Rsqlite, ff and biglm, data.sample
• E- Conclusion
Cdiscount.com - Commark

1 – Intro
3Cdiscount.com - Commark
Problem setup
- Your csv file is too big to import into R. Say multiple of
10GO,
- Typically, your first read.table ends up with an error
message
« Cannot allocate a vector of size XXX »
How to fix it?
It depends on:
- What you want to do (data management sql like queries,
datamining,…)
- Your environnment (Corporate with a Datawarehouse?)
- The size of your data

Three basic strategies
• Buy memory in a cloud environnement
- Can handle multiple 10Go
- Cheap (1,5 euro per hour for 60Go)
- No need to rewrite all your code
But you need to configure it (see for example )
 Preferred strategy in most cases
• Try packages for SQL-like needs, try ff, rsqlite
- Not limited to RAM (multiple 10Go)
But no advanced datamining libraries
And you need to rewrite your code….
• Sampling :data.sample package

Dataset
• http://stat-computing.org/dataexpo/2009/the-data.html
• More 100 million observations, 12 G0
The data comes originally from RITA where it is described in detail. You can
download the data there, or from the bzipped csv files listed below. These
files have derivable variables removed, are packaged in yearly chunks and
have been more heavily compressed than the originals.
Download individual years:
1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
1999, 2000, 2001,2002, 2003, 2004, 2005, 2006, 2007, 2008
29 variables
Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 ….

1 Import the data files and create one unique large csv file
##import the data from http://stat-computing.org/dataexpo/2009/the-data.html
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
if ( !file.exists(file.name) ) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/",
year, ".csv.bz2", sep = "")
cat("Downloading missing data file ", file.name, "n", sep = "")
download.file(url.text, file.name)
}
}
##create a unique large data file named airlines.csv by
first <- TRUE
csv.file <- "airlines.csv" # Write combined integer-only data to this file
csv.con <- file(csv.file, open = "w")
system.time(
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
cat("Processing ", file.name, "n", sep = "")
d <- read.csv(file.name)
## Convert the strings to integers
write.table(d, file = csv.con, sep = ",",
row.names = FALSE, col.names = first)
first <- FALSE
}
)
close(csv.con)

BigMemory Package
##09/09/2013: does not seem to exist on windows for R.3.0.0
install.packages("bigmemory", repos="http://R-Forge.R-
project.org")
install.packages("biganalytics", repos="http://R-Forge.R-
project.org")
#library(bigmemory)
#x <-read.big.matrix("airlines.csv", type ="integer", header = TRUE
,backingfile ="airline.bin",
# descriptorfile ="airline.desc",extraCols ="Age")
#library(biganalytics)
#blm <- biglm.big.matrix(ArrDelay~Age+Year,data=x)

ff package
library(ffbase)
system.time(hhp <- read.table.ffdf(file="airlines.csv",
FUN = "read.csv", na.strings = "NA",
nrows=10000000))
#takes 1min40sec
#with no nrows arguement, message error,
# ffbase does not support char type
class(hhp)
dim(hhp)
str(hhp[1:10,])
result <- list()
## Some basic showoff
result$UniqueCarrier <- unique(hhp$UniqueCarrier)
#15 sec
## Basic example of operators is.na.ff, the ! operator and sum.ff
sum(!is.na(hhp$ArrDelay ))
## all and any
any(is.na(hhp$ArrDelay))
all(!is.na(hhp$ArrDelay))

ff package and Biglm
##
## Make a linear model using biglm
##
require(biglm)
mymodel <- bigglm(ArrDelay ~ -1+DayOfWeek,
data =hhp)
#takes 30 sec for 10M rows
summary(mymodel)
predict(mymodel,newdata=hhp)

RSQLITE
library(RSQLite)
library(sqldf)
library(foreign)
# create an empty database.
# can skip this step if database already exists.
# read into table called iris in the testingdb sqlite database
sqldf("attach testingdb as new")
read.csv.sql("airlines.csv", sql = "create table baseflux as select * from file",
dbname = "testingdb",row.names=F, eol="n")
#on Windows, specifiy eol="n"
#takes 2,5 hours
# look at first three lines
sqldf("select * from baseflux limit 10", dbname = "testingdb")
#takes 1 minute ?
#count the number of flights whose distance is greater than 500, departing from SF
sqldf("select count(*) as nb
from baseflux
where distance>500
and Origin='SFO'"
, dbname = "testingdb")

Rsqlite
##If your intention was to read the file into R immediately after
#reading it into the database
#and you don't really need the database after that then see
airlines <- read.csv.sql("airlines.csv", sql = "select * from
file",eol="n")
######
#NB: the package does not handle missing value,
#Translate the empty fields to some number
#that will represent NA and then fix it up on the R end.

Sampling is bad for...
• Reporting
The boss wants to know the accurate growth rate, not a statistical
estimation...
• Data management
You will not be able to access the role of this particular customer

Sampling is good for analysis
Because
1 what matters is the order of magnitude, not the accurate results
2. sampling error is very small compared to Model error,
Measurement errors, estimation error, Model noise,...
3 sampling error depends on the size of the sample, not on the
whole dataset.
4 everything is a sample at the end
5 when sampling works very bad, then your conclusions are not
robust
6 Anyway, how will we deal with non linear complexity, even in
the cloud?

data.sample
Features of data.sample
• it works on your laptop, whatever your RAM is, it just takes
time
• no need to install other Big Data soft (RBD, NoSQL) on top
of R
• no need to rewrite all your code, just change one single line
data.sample takes the same arguments as read.table: nothing
to learn
Simulations
Model Y = 3X +1fG=Ag+21fG=Bg+31fG=Cg+e
X = 1; :::;N, G discrete random variables, e some noise
Simulate 100 millions observations: 2.3Go
Code
dataset<-data.sample(simulations.csv,sep=,,header=T)
#takes 12min on my laptop
t<-lm(y.,data=dataset)
summary(t)
Call: lm(formula = y ~ -1 + x + g, data = dataset)
Coecients: x gA gB gC 3.0000 0.9984 1.9996 2.9963

data.sample package
install.packages("D:/U/Data.sample/data.s
ample_1.0.zip", repos = NULL)
library(data.sample)
system.time(resultsample<-
data.sample(file="airlines.csv",header=T,s
ep=",")$df)
#takes 52 minutes on my laptop if you
don’t know the number of records
# this step is done only once!

data.sample package
#fit your linear model
mymodelsample <- lm(ArrDelay ~ -1+as.factor(DayOfWeek), data
=resultsample)
Summary(mymodelsample)
Estimate Std. Error t value Pr(>|t|)
as.factor(DayOfWe
ek)1 6.58383 0.08041 81.88 <2e-16 ***
as.factor(DayOfWe
ek)2 6.04881 0.08054 75.10 <2e-16 ***
as.factor(DayOfWe
ek)3 6.80039 0.08037 84.61 <2e-16 ***
as.factor(DayOfWe
ek)4 8.96406 0.08045 111.42 <2e-16 ***
as.factor(DayOfWe
ek)5 9.45303 0.08015 117.94 <2e-16 ***
as.factor(DayOfWe
ek)6 4.15234 0.08535 48.65 <2e-16 ***
as.factor(DayOfWe
ek)7 6.40236 0.08222 77.87 <2e-16 ***

data.sample package

Conclusion
SQL like Datamining
strategies
Beyond the
RAM
Pros Cons
cloud OK OK OK No rewrite,
cheap
Cloud
configuratio
n
Ff, biglm OK KO but
regression
OK Not limited
to RAM
Rewrite,
very limited
for
datamining
rsqlite OK KO OK Not limited
to RAM
Rewrite, no
datamining
Data.sample OK OK OK No rewrite,
fast coding,
can use all
libraries
No
reporting,
lack of
theoretical
results
Data.table OK KO KO Limited to
RAM, no
datamining
Fast (index)

Gur1009

More Related Content

What's hot

Viewers also liked

Similar to Gur1009

More from Cdiscount

Recently uploaded

Gur1009