SlideShare a Scribd company logo
(not so) Big Data with R
GUR Fltaur
Matthieu Cornec
Matthieu.cornec@cdiscount.com
10/09/2013
Cdiscount.com - Commark
Outline
2
• A- Intro
• B- Problem setup
• C- 3 strategies
• D- Packages : Rsqlite, ff and biglm, data.sample
• E- Conclusion
Cdiscount.com - Commark
1 – Intro
3Cdiscount.com - Commark
Problem setup
- Your csv file is too big to import into R. Say multiple of
10GO,
- Typically, your first read.table ends up with an error
message
« Cannot allocate a vector of size XXX »
How to fix it?
It depends on:
- What you want to do (data management sql like queries,
datamining,…)
- Your environnment (Corporate with a Datawarehouse?)
- The size of your data
Three basic strategies
4Cdiscount.com - Commark
• Buy memory in a cloud environnement
- Can handle multiple 10Go
- Cheap (1,5 euro per hour for 60Go)
- No need to rewrite all your code
But you need to configure it (see for example )
 Preferred strategy in most cases
• Try packages for SQL-like needs, try ff, rsqlite
- Not limited to RAM (multiple 10Go)
But no advanced datamining libraries
And you need to rewrite your code….
• Sampling :data.sample package
Dataset
5Cdiscount.com - Commark
• http://stat-computing.org/dataexpo/2009/the-data.html
• More 100 million observations, 12 G0
The data comes originally from RITA where it is described in detail. You can
download the data there, or from the bzipped csv files listed below. These
files have derivable variables removed, are packaged in yearly chunks and
have been more heavily compressed than the originals.
Download individual years:
1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
1999, 2000, 2001,2002, 2003, 2004, 2005, 2006, 2007, 2008
29 variables
Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 ….
1 Import the data files and create one unique large csv file
6Cdiscount.com - Commark
##import the data from http://stat-computing.org/dataexpo/2009/the-data.html
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
if ( !file.exists(file.name) ) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/",
year, ".csv.bz2", sep = "")
cat("Downloading missing data file ", file.name, "n", sep = "")
download.file(url.text, file.name)
}
}
##create a unique large data file named airlines.csv by
first <- TRUE
csv.file <- "airlines.csv" # Write combined integer-only data to this file
csv.con <- file(csv.file, open = "w")
system.time(
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
cat("Processing ", file.name, "n", sep = "")
d <- read.csv(file.name)
## Convert the strings to integers
write.table(d, file = csv.con, sep = ",",
row.names = FALSE, col.names = first)
first <- FALSE
}
)
close(csv.con)
BigMemory Package
7Cdiscount.com - Commark
##09/09/2013: does not seem to exist on windows for R.3.0.0
install.packages("bigmemory", repos="http://R-Forge.R-
project.org")
install.packages("biganalytics", repos="http://R-Forge.R-
project.org")
#library(bigmemory)
#x <-read.big.matrix("airlines.csv", type ="integer", header = TRUE
,backingfile ="airline.bin",
# descriptorfile ="airline.desc",extraCols ="Age")
#library(biganalytics)
#blm <- biglm.big.matrix(ArrDelay~Age+Year,data=x)
ff package
8Cdiscount.com - Commark
library(ffbase)
system.time(hhp <- read.table.ffdf(file="airlines.csv",
FUN = "read.csv", na.strings = "NA",
nrows=10000000))
#takes 1min40sec
#with no nrows arguement, message error,
# ffbase does not support char type
class(hhp)
dim(hhp)
str(hhp[1:10,])
result <- list()
## Some basic showoff
result$UniqueCarrier <- unique(hhp$UniqueCarrier)
#15 sec
## Basic example of operators is.na.ff, the ! operator and sum.ff
sum(!is.na(hhp$ArrDelay ))
## all and any
any(is.na(hhp$ArrDelay))
all(!is.na(hhp$ArrDelay))
ff package and Biglm
9Cdiscount.com - Commark
##
## Make a linear model using biglm
##
require(biglm)
mymodel <- bigglm(ArrDelay ~ -1+DayOfWeek,
data =hhp)
#takes 30 sec for 10M rows
summary(mymodel)
predict(mymodel,newdata=hhp)
RSQLITE
10Cdiscount.com - Commark
library(RSQLite)
library(sqldf)
library(foreign)
# create an empty database.
# can skip this step if database already exists.
# read into table called iris in the testingdb sqlite database
sqldf("attach testingdb as new")
read.csv.sql("airlines.csv", sql = "create table baseflux as select * from file",
dbname = "testingdb",row.names=F, eol="n")
#on Windows, specifiy eol="n"
#takes 2,5 hours
# look at first three lines
sqldf("select * from baseflux limit 10", dbname = "testingdb")
#takes 1 minute ?
#count the number of flights whose distance is greater than 500, departing from SF
sqldf("select count(*) as nb
from baseflux
where distance>500
and Origin='SFO'"
, dbname = "testingdb")
Rsqlite
11Cdiscount.com - Commark
##If your intention was to read the file into R immediately after
#reading it into the database
#and you don't really need the database after that then see
airlines <- read.csv.sql("airlines.csv", sql = "select * from
file",eol="n")
######
#NB: the package does not handle missing value,
#Translate the empty fields to some number
#that will represent NA and then fix it up on the R end.
Sampling is bad for...
12Cdiscount.com - Commark
• Reporting
The boss wants to know the accurate growth rate, not a statistical
estimation...
• Data management
You will not be able to access the role of this particular customer
Sampling is good for analysis
13Cdiscount.com - Commark
Because
1 what matters is the order of magnitude, not the accurate results
2. sampling error is very small compared to Model error,
Measurement errors, estimation error, Model noise,...
3 sampling error depends on the size of the sample, not on the
whole dataset.
4 everything is a sample at the end
5 when sampling works very bad, then your conclusions are not
robust
6 Anyway, how will we deal with non linear complexity, even in
the cloud?
data.sample
14Cdiscount.com - Commark
Features of data.sample
• it works on your laptop, whatever your RAM is, it just takes
time
• no need to install other Big Data soft (RBD, NoSQL) on top
of R
• no need to rewrite all your code, just change one single line
data.sample takes the same arguments as read.table: nothing
to learn
Simulations
Model Y = 3X +1fG=Ag+21fG=Bg+31fG=Cg+e
X = 1; :::;N, G discrete random variables, e some noise
Simulate 100 millions observations: 2.3Go
Code
dataset<-data.sample(simulations.csv,sep=,,header=T)
#takes 12min on my laptop
t<-lm(y.,data=dataset)
summary(t)
Call: lm(formula = y ~ -1 + x + g, data = dataset)
Coecients: x gA gB gC 3.0000 0.9984 1.9996 2.9963
data.sample package
15Cdiscount.com - Commark
install.packages("D:/U/Data.sample/data.s
ample_1.0.zip", repos = NULL)
library(data.sample)
system.time(resultsample<-
data.sample(file="airlines.csv",header=T,s
ep=",")$df)
#takes 52 minutes on my laptop if you
don’t know the number of records
# this step is done only once!
data.sample package
16Cdiscount.com - Commark
#fit your linear model
mymodelsample <- lm(ArrDelay ~ -1+as.factor(DayOfWeek), data
=resultsample)
Summary(mymodelsample)
Estimate Std. Error t value Pr(>|t|)
as.factor(DayOfWe
ek)1 6.58383 0.08041 81.88 <2e-16 ***
as.factor(DayOfWe
ek)2 6.04881 0.08054 75.10 <2e-16 ***
as.factor(DayOfWe
ek)3 6.80039 0.08037 84.61 <2e-16 ***
as.factor(DayOfWe
ek)4 8.96406 0.08045 111.42 <2e-16 ***
as.factor(DayOfWe
ek)5 9.45303 0.08015 117.94 <2e-16 ***
as.factor(DayOfWe
ek)6 4.15234 0.08535 48.65 <2e-16 ***
as.factor(DayOfWe
ek)7 6.40236 0.08222 77.87 <2e-16 ***
data.sample package
17Cdiscount.com - Commark
Conclusion
18Cdiscount.com - Commark
SQL like Datamining
strategies
Beyond the
RAM
Pros Cons
cloud OK OK OK No rewrite,
cheap
Cloud
configuratio
n
Ff, biglm OK KO but
regression
OK Not limited
to RAM
Rewrite,
very limited
for
datamining
rsqlite OK KO OK Not limited
to RAM
Rewrite, no
datamining
Data.sample OK OK OK No rewrite,
fast coding,
can use all
libraries
No
reporting,
lack of
theoretical
results
Data.table OK KO KO Limited to
RAM, no
datamining
Fast (index)

More Related Content

What's hot

PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
Reuven Lerner
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
Bopyo Hong
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12
Andrew Dunstan
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
Hao Chen
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?
José Lin
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
Python for R users
Python for R usersPython for R users
Python for R users
Satyarth Praveen
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Group
acogoluegnes
 
12c SQL Plan Directives
12c SQL Plan Directives12c SQL Plan Directives
12c SQL Plan Directives
Franck Pachot
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
StORM preview
StORM previewStORM preview
StORM preview
David Chandler
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyond
Tomas Vondra
 
Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)
Noriyoshi Shinoda
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 

What's hot (20)

PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?What's new in PostgreSQL 11 ?
What's new in PostgreSQL 11 ?
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Group
 
12c SQL Plan Directives
12c SQL Plan Directives12c SQL Plan Directives
12c SQL Plan Directives
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
StORM preview
StORM previewStORM preview
StORM preview
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyond
 
Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)Let's scale-out PostgreSQL using Citus (English)
Let's scale-out PostgreSQL using Citus (English)
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

Viewers also liked

R2DOCX : R + WORD
R2DOCX : R + WORDR2DOCX : R + WORD
R2DOCX : R + WORD
Cdiscount
 
R Devtools
R DevtoolsR Devtools
R Devtools
Cdiscount
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown
Cdiscount
 
FLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caretFLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caretjfeudeline
 
R in latex
R in latexR in latex
R in latex
Julyan Arbel
 
Exports de r vers office
Exports de r vers officeExports de r vers office
Exports de r vers officefrancoismarical
 
R aux enquêtes de conjoncture
R aux enquêtes de conjonctureR aux enquêtes de conjoncture
R aux enquêtes de conjoncturefrancoismarical
 
Fltau r interface
Fltau r interfaceFltau r interface
Fltau r interfaceCdiscount
 
Première approche de cartographie sous R
Première approche de cartographie sous RPremière approche de cartographie sous R
Première approche de cartographie sous R
Cdiscount
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
Cdiscount
 
Incorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageIncorporer du C dans R, créer son package
Incorporer du C dans R, créer son package
Cdiscount
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec R
Cdiscount
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
Cdiscount
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec R
Cdiscount
 
HADOOP + R
HADOOP + RHADOOP + R
HADOOP + R
Cdiscount
 
Paris2012 session1
Paris2012 session1Paris2012 session1
Paris2012 session1
Cdiscount
 
Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Cdiscount
 
Paris2012 session3b
Paris2012 session3bParis2012 session3b
Paris2012 session3b
Cdiscount
 

Viewers also liked (20)

Big data with r
Big data with rBig data with r
Big data with r
 
R2DOCX : R + WORD
R2DOCX : R + WORDR2DOCX : R + WORD
R2DOCX : R + WORD
 
R Devtools
R DevtoolsR Devtools
R Devtools
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown
 
R versur Python
R versur PythonR versur Python
R versur Python
 
FLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caretFLTauR - Construction de modèles de prévision sous r avec le package caret
FLTauR - Construction de modèles de prévision sous r avec le package caret
 
R in latex
R in latexR in latex
R in latex
 
Exports de r vers office
Exports de r vers officeExports de r vers office
Exports de r vers office
 
R aux enquêtes de conjoncture
R aux enquêtes de conjonctureR aux enquêtes de conjoncture
R aux enquêtes de conjoncture
 
Fltau r interface
Fltau r interfaceFltau r interface
Fltau r interface
 
Première approche de cartographie sous R
Première approche de cartographie sous RPremière approche de cartographie sous R
Première approche de cartographie sous R
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Incorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageIncorporer du C dans R, créer son package
Incorporer du C dans R, créer son package
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec R
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec R
 
HADOOP + R
HADOOP + RHADOOP + R
HADOOP + R
 
Paris2012 session1
Paris2012 session1Paris2012 session1
Paris2012 session1
 
Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06
 
Paris2012 session3b
Paris2012 session3bParis2012 session3b
Paris2012 session3b
 

Similar to Gur1009

R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
Dhafer Malouche
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
psathishcs
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Citus Data
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
Katie Gulley
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revised
Barry DeCicco
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
ScyllaDB
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
Padraig O'Sullivan
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
zznate
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
Doris Chen
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
Amazon Web Services
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 

Similar to Gur1009 (20)

R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Handout3o
Handout3oHandout3o
Handout3o
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revised
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 

More from Cdiscount

Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4
Cdiscount
 
Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Cdiscount
 
State Space Model
State Space ModelState Space Model
State Space Model
Cdiscount
 
Paris2012 session2
Paris2012 session2Paris2012 session2
Paris2012 session2
Cdiscount
 
Prévisions trafic aérien
Prévisions trafic aérienPrévisions trafic aérien
Prévisions trafic aérien
Cdiscount
 
Robust sequentiel learning
Robust sequentiel learningRobust sequentiel learning
Robust sequentiel learning
Cdiscount
 
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Cdiscount
 
Comptabilité Nationale avec R
Comptabilité Nationale avec RComptabilité Nationale avec R
Comptabilité Nationale avec R
Cdiscount
 
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cdiscount
 
Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cdiscount
 
Prévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMPrévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAM
Cdiscount
 
RStudio is good for you
RStudio is good for youRStudio is good for you
RStudio is good for you
Cdiscount
 
R fait du la tex
R fait du la texR fait du la tex
R fait du la tex
Cdiscount
 
Forecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business SurveysForecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business Surveys
Cdiscount
 
Prediction in dynamic Graphs
Prediction in dynamic GraphsPrediction in dynamic Graphs
Prediction in dynamic Graphs
Cdiscount
 

More from Cdiscount (16)

Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4
 
Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Scm prix blé_2012_11_06
Scm prix blé_2012_11_06
 
Scm risques
Scm risquesScm risques
Scm risques
 
State Space Model
State Space ModelState Space Model
State Space Model
 
Paris2012 session2
Paris2012 session2Paris2012 session2
Paris2012 session2
 
Prévisions trafic aérien
Prévisions trafic aérienPrévisions trafic aérien
Prévisions trafic aérien
 
Robust sequentiel learning
Robust sequentiel learningRobust sequentiel learning
Robust sequentiel learning
 
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
 
Comptabilité Nationale avec R
Comptabilité Nationale avec RComptabilité Nationale avec R
Comptabilité Nationale avec R
 
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)
 
Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1)
 
Prévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMPrévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAM
 
RStudio is good for you
RStudio is good for youRStudio is good for you
RStudio is good for you
 
R fait du la tex
R fait du la texR fait du la tex
R fait du la tex
 
Forecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business SurveysForecasting GDP profile with an application to French Business Surveys
Forecasting GDP profile with an application to French Business Surveys
 
Prediction in dynamic Graphs
Prediction in dynamic GraphsPrediction in dynamic Graphs
Prediction in dynamic Graphs
 

Recently uploaded

A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 

Recently uploaded (20)

A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 

Gur1009

  • 1. (not so) Big Data with R GUR Fltaur Matthieu Cornec Matthieu.cornec@cdiscount.com 10/09/2013 Cdiscount.com - Commark
  • 2. Outline 2 • A- Intro • B- Problem setup • C- 3 strategies • D- Packages : Rsqlite, ff and biglm, data.sample • E- Conclusion Cdiscount.com - Commark
  • 3. 1 – Intro 3Cdiscount.com - Commark Problem setup - Your csv file is too big to import into R. Say multiple of 10GO, - Typically, your first read.table ends up with an error message « Cannot allocate a vector of size XXX » How to fix it? It depends on: - What you want to do (data management sql like queries, datamining,…) - Your environnment (Corporate with a Datawarehouse?) - The size of your data
  • 4. Three basic strategies 4Cdiscount.com - Commark • Buy memory in a cloud environnement - Can handle multiple 10Go - Cheap (1,5 euro per hour for 60Go) - No need to rewrite all your code But you need to configure it (see for example )  Preferred strategy in most cases • Try packages for SQL-like needs, try ff, rsqlite - Not limited to RAM (multiple 10Go) But no advanced datamining libraries And you need to rewrite your code…. • Sampling :data.sample package
  • 5. Dataset 5Cdiscount.com - Commark • http://stat-computing.org/dataexpo/2009/the-data.html • More 100 million observations, 12 G0 The data comes originally from RITA where it is described in detail. You can download the data there, or from the bzipped csv files listed below. These files have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals. Download individual years: 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,2002, 2003, 2004, 2005, 2006, 2007, 2008 29 variables Name Description 1 Year 1987-2008 2 Month 1-12 3 DayofMonth 1-31 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure time (local, hhmm) 6 ….
  • 6. 1 Import the data files and create one unique large csv file 6Cdiscount.com - Commark ##import the data from http://stat-computing.org/dataexpo/2009/the-data.html for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") if ( !file.exists(file.name) ) { url.text <- paste("http://stat-computing.org/dataexpo/2009/", year, ".csv.bz2", sep = "") cat("Downloading missing data file ", file.name, "n", sep = "") download.file(url.text, file.name) } } ##create a unique large data file named airlines.csv by first <- TRUE csv.file <- "airlines.csv" # Write combined integer-only data to this file csv.con <- file(csv.file, open = "w") system.time( for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") cat("Processing ", file.name, "n", sep = "") d <- read.csv(file.name) ## Convert the strings to integers write.table(d, file = csv.con, sep = ",", row.names = FALSE, col.names = first) first <- FALSE } ) close(csv.con)
  • 7. BigMemory Package 7Cdiscount.com - Commark ##09/09/2013: does not seem to exist on windows for R.3.0.0 install.packages("bigmemory", repos="http://R-Forge.R- project.org") install.packages("biganalytics", repos="http://R-Forge.R- project.org") #library(bigmemory) #x <-read.big.matrix("airlines.csv", type ="integer", header = TRUE ,backingfile ="airline.bin", # descriptorfile ="airline.desc",extraCols ="Age") #library(biganalytics) #blm <- biglm.big.matrix(ArrDelay~Age+Year,data=x)
  • 8. ff package 8Cdiscount.com - Commark library(ffbase) system.time(hhp <- read.table.ffdf(file="airlines.csv", FUN = "read.csv", na.strings = "NA", nrows=10000000)) #takes 1min40sec #with no nrows arguement, message error, # ffbase does not support char type class(hhp) dim(hhp) str(hhp[1:10,]) result <- list() ## Some basic showoff result$UniqueCarrier <- unique(hhp$UniqueCarrier) #15 sec ## Basic example of operators is.na.ff, the ! operator and sum.ff sum(!is.na(hhp$ArrDelay )) ## all and any any(is.na(hhp$ArrDelay)) all(!is.na(hhp$ArrDelay))
  • 9. ff package and Biglm 9Cdiscount.com - Commark ## ## Make a linear model using biglm ## require(biglm) mymodel <- bigglm(ArrDelay ~ -1+DayOfWeek, data =hhp) #takes 30 sec for 10M rows summary(mymodel) predict(mymodel,newdata=hhp)
  • 10. RSQLITE 10Cdiscount.com - Commark library(RSQLite) library(sqldf) library(foreign) # create an empty database. # can skip this step if database already exists. # read into table called iris in the testingdb sqlite database sqldf("attach testingdb as new") read.csv.sql("airlines.csv", sql = "create table baseflux as select * from file", dbname = "testingdb",row.names=F, eol="n") #on Windows, specifiy eol="n" #takes 2,5 hours # look at first three lines sqldf("select * from baseflux limit 10", dbname = "testingdb") #takes 1 minute ? #count the number of flights whose distance is greater than 500, departing from SF sqldf("select count(*) as nb from baseflux where distance>500 and Origin='SFO'" , dbname = "testingdb")
  • 11. Rsqlite 11Cdiscount.com - Commark ##If your intention was to read the file into R immediately after #reading it into the database #and you don't really need the database after that then see airlines <- read.csv.sql("airlines.csv", sql = "select * from file",eol="n") ###### #NB: the package does not handle missing value, #Translate the empty fields to some number #that will represent NA and then fix it up on the R end.
  • 12. Sampling is bad for... 12Cdiscount.com - Commark • Reporting The boss wants to know the accurate growth rate, not a statistical estimation... • Data management You will not be able to access the role of this particular customer
  • 13. Sampling is good for analysis 13Cdiscount.com - Commark Because 1 what matters is the order of magnitude, not the accurate results 2. sampling error is very small compared to Model error, Measurement errors, estimation error, Model noise,... 3 sampling error depends on the size of the sample, not on the whole dataset. 4 everything is a sample at the end 5 when sampling works very bad, then your conclusions are not robust 6 Anyway, how will we deal with non linear complexity, even in the cloud?
  • 14. data.sample 14Cdiscount.com - Commark Features of data.sample • it works on your laptop, whatever your RAM is, it just takes time • no need to install other Big Data soft (RBD, NoSQL) on top of R • no need to rewrite all your code, just change one single line data.sample takes the same arguments as read.table: nothing to learn Simulations Model Y = 3X +1fG=Ag+21fG=Bg+31fG=Cg+e X = 1; :::;N, G discrete random variables, e some noise Simulate 100 millions observations: 2.3Go Code dataset<-data.sample(simulations.csv,sep=,,header=T) #takes 12min on my laptop t<-lm(y.,data=dataset) summary(t) Call: lm(formula = y ~ -1 + x + g, data = dataset) Coecients: x gA gB gC 3.0000 0.9984 1.9996 2.9963
  • 15. data.sample package 15Cdiscount.com - Commark install.packages("D:/U/Data.sample/data.s ample_1.0.zip", repos = NULL) library(data.sample) system.time(resultsample<- data.sample(file="airlines.csv",header=T,s ep=",")$df) #takes 52 minutes on my laptop if you don’t know the number of records # this step is done only once!
  • 16. data.sample package 16Cdiscount.com - Commark #fit your linear model mymodelsample <- lm(ArrDelay ~ -1+as.factor(DayOfWeek), data =resultsample) Summary(mymodelsample) Estimate Std. Error t value Pr(>|t|) as.factor(DayOfWe ek)1 6.58383 0.08041 81.88 <2e-16 *** as.factor(DayOfWe ek)2 6.04881 0.08054 75.10 <2e-16 *** as.factor(DayOfWe ek)3 6.80039 0.08037 84.61 <2e-16 *** as.factor(DayOfWe ek)4 8.96406 0.08045 111.42 <2e-16 *** as.factor(DayOfWe ek)5 9.45303 0.08015 117.94 <2e-16 *** as.factor(DayOfWe ek)6 4.15234 0.08535 48.65 <2e-16 *** as.factor(DayOfWe ek)7 6.40236 0.08222 77.87 <2e-16 ***
  • 18. Conclusion 18Cdiscount.com - Commark SQL like Datamining strategies Beyond the RAM Pros Cons cloud OK OK OK No rewrite, cheap Cloud configuratio n Ff, biglm OK KO but regression OK Not limited to RAM Rewrite, very limited for datamining rsqlite OK KO OK Not limited to RAM Rewrite, no datamining Data.sample OK OK OK No rewrite, fast coding, can use all libraries No reporting, lack of theoretical results Data.table OK KO KO Limited to RAM, no datamining Fast (index)