Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Workshop – Hadoop + R

Carlos Gil Bellosta

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Big Data Analytics
R & Hadoop

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Carlos J. Gil Bellosta

Details of
mapreduce

cgb@datanalytics.com

Scoring,
sampling &
simulating

November 2013

Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

1 Intro to Hadoop & R

All about Hadoop
Hadoop FS
Hadoop & mapreduce

All about R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks

Big Data
Analytics

File system: manages all about
files

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!
• Combination of hardware and software to hide boring

activities from users:
•
•
•
•

Find space to write the files
Read/write files
Manage fragmentation
Etc.

• How many devices per FS?
• 1-to-1: diskettes, CD-ROMs, HDDs,...
• n-to-1: partitioned HDDs,...
• 1-to-n: RAIDs, Hadoop

Big Data
Analytics

Hadoop goodies (as a FS)

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Chuncks (large) ﬁles among machines
• Replicates chunks (default, 3)
• Balances data
• Robust to hardware failures
• It is rack aware

Obviously, it requires some system to keep track of:
• Which servers/racks are up/down
• Where each chunk is located
• ...

Big Data
Analytics

How to work with data in Hadoop?

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce

• Provides a shell (ls, cp, etc.)
• You can put/get data from your local FS to Hadoop FS
• This is:
• You can dump your data to your local machine
• You can run your programs in your local machine
• You can put results back into Hadoop
• But what if the ﬁle is too large?

Scoring,
sampling &
simulating

Solution

Data
modelling

Rather than bringing the data to the code, why not moving the
code to the data?

Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

One of the ways to move code to data is known as mapreduce.

Big Data
Analytics

Mapreduce

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Two step process:
• Map: run your code on chunks all over
• Reduce: reshape the output into the desired format
• Hadoop manages issues:
• System failures
• Threads that do not return
• And all (?) that made life of OpenMP, MPI, etc. users
miserable
• Slotted approach: mapreduce provides slots where you put

the mappers/reducers code
• The code is for you to provide!

Big Data
Analytics

What is R?

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• R is a
• software package?
• programming language?
• environment?

for data analysis and graphics.
• R users are (should be?) used to the mapreduce approach:

ddply(dfx, .(group, sex), summarize,
mean = mean(age),
sd
= sd(age))

Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks


Graphics & big data
Let’s count... hexagons
5 Data modelling
6 Final remarks

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Visualizing a million

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Fluctuation plot

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Table plot

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Non-trivial counting exercise (no, we are not counting

words today!)
• Good visualization features for big datasets
• Fits in mapreduce framework:
• Map: Assigns points to hexagons
• Reduce: aggregates counts on hexagons
• The output is small and can be plotted locally

Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R


Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating

5 Data modelling

Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

6 Final remarks

Big Data
Analytics
Carlos J. Gil
Bellosta

What you see: input/output, map,
reduce

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• input:
• Type: text, csv, R object,...
• Options: separator,...
• output: similar to input
• map & reduce:
• Functions with (k,v) argument (k, key; v, value)
• They return a k,v list
• Thus, mapreduces can be chained together (the output of
the ﬁrst one is the input for the second)

Big Data
Analytics
Carlos J. Gil
Bellosta

What you don’t see

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes
-D stream.map.output=typedbytes
-D stream.reduce.input=typedbytes
-D stream.reduce.output=typedbytes
-D mapred.reduce.tasks=0
-input /tmp/RtmpUUrNMj/file68c0185e60c
-output /tmp/RtmpUUrNMj/file68c04c25d5f0
-mapper "Rscript rmr-streaming-map68c018acf680 "
-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a
-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080
-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680
-inputformat org.apache.hadoop.streaming.AutoInputFormat
-outputformat org.apache.hadoop.mapred.SequenceFileOutputForm

Big Data
Analytics

Scoring

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Externals consultants build a model (using R and small

data)
• Models in R should have a predict method
• You can then score your huge database (in batch)
• No need to rewrite the model into your systems!

Big Data
Analytics

The case for sampling

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Sampling works!
• Sampled datasets can be used to build small data models
• You can use R (& mapreduce) to sample data, but you

better not

Big Data
Analytics
Carlos J. Gil
Bellosta

Running simulations on Hadoop

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Some (many?) people say it is not the right tool
• You need input data, but simulations often not
• You want to control the number of mappers (which run

your simulations)
• Still mapreduce is nice for simulations...
• ... so let and old dog try its dirty trick!

Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

5 Data modelling

Linear Regression
Logistic Regression
Trees & Random Forests
6 Final remarks

Big Data
Analytics

Linear regression can be
parallelized

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Simple linear regression: y ∼ α + βx

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

ˆ
β=

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

=

n
¯
¯
i=1 (xi − x )(yi − y )
=
n
(xi − x )2
¯
i=1
n
n
n
1
i=1 xi yi − n
i=1 xi
j=1 yj
n
2 ) − 1 ( n x )2
i=1 (xi
i=1 i
n

Operations are case by case!

Big Data
Analytics
Carlos J. Gil
Bellosta

Multiple linear regression

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Based on X X and X y :

ˆ
β = (X X )−1 X y
• If X = [X1 |...|Xn ] (by blocks), then X X =

i

Xi Xi .

Big Data
Analytics
Carlos J. Gil
Bellosta

Can logistic regression be
parallelized? Yes and no.

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Fitting logistic regression models is iterative and iterations

are not parallelizable.
• However, each iteration can be parallelized (these are not

unlike ﬁtting linear models as before)
• We will explore two big data alternatives:
• Parallelize iterations using mapreduce (see

http://goo.gl/ftx36r)
• Split your data meaningfully and do standard logistic

regression in the nodes

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

How many bytes make knowledge?
(aka the fractal nature of big data)

Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Splitted logistic regression

Big Data
Analytics
Carlos J. Gil
Bellosta

Viable alternatives to logistic
models

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Trees
• High interpretability
• But unstable and tend to miss out details
• Random forests
• Black boxes
• Superb performance
• These are collections of trees that can be built in parallel
• Both can be parallelized indiﬀerent ways:
• Similar to partitioned logistic models above
• Within training

Big Data
Analytics
Carlos J. Gil
Bellosta

Forget most of what you learned
today, seriously

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• People strive to extend small data models to big data (as

we did today)...
• ... but is it the way to go?
• Achtung microlocal structure
• Small data people knows microlocal structure as outliers
• Global models (linear, logistic,...) cannot (easily?) exploit
microlocal structure
• But the promises of big data lie precisely there
• (Otherwise, just sample and you will be ﬁne)
• Areas to watch for insights on big data modelling:
• SNA (networks analysis)
• Text analysis

Big Data
Analytics
Carlos J. Gil
Bellosta

Thank you very much and...

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

... questions?

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Similar to Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013