Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science.
We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.
2. Would
You
Like
to…
• Predict
X?
– The
outcome
of
a
future
event
– Who
is
likely
to
do
something
– Gene?c
factors
leading
to
disease
• Pre-‐filter
things
so
humans
can
accomplish
more?
• Do
all
of
this
faster
and
beCer?
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
2
3. Why
R
and
Hadoop?
• R
is
a
fantas?c
plaHorm
for
data
science
– Has
a
peer-‐reviewed
community
and
journal
that
vets
libraries
– (Mostly)
intui?ve
language
• Hadoop
is
the
de-‐facto
plaHorm
for
parallel
processing
• Today,
we’ll
be
talking
about
rmr,
but
there’s
two
more
packages:
rhbase
and
rhdfs
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
3
4. Nothing
Has
Changed.
Everything
Has
Changed.
• Some
of
the
most
effec?ve
techniques
for
data
mining
are
rela?vely
old
– Modern
SVM
dates
back
to
‘92
– Logis?c
regression
dates
back
to
‘44
– Important
elements
of
the
algorithms
date
back
to
Newton
• Accessibility
and
relevance
have
changed
– Accessibility
to
data
– Accessibility
of
computa?onal
power
– Necessity
of
methods
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
4
5. Some
CriBcisms
of
R
&
Rhadoop
• R
docs
are
wriCen
in
their
own
language
(using
data
frames,
etc.)
that
is
unfamiliar
to
computer
scien?sts
• R
and
CRAN
documenta?on
are
more
like
old-‐school
GNU
than
most
Apache
projects
– Get
used
to
Googling
and
using
R’s
help()
func?on
• R’s
data
management
facili?es
are
inconsistent
• Streaming
API
isn’t
super
fast
• (get
over
it)
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
5
6. Comparison
to
Other
R
Parallelism
Frameworks
• SNOW/SNOWFALL
– Operates
over
MPI,
Sockets,
or
PVM
– No
?e-‐in
to
a
DFS
(bad
for
data-‐intensive
compu?ng)
– Handles
matrix
mul?plica?on
well
(perhaps
beCer)
– Doesn’t
handle
other
non-‐trivial
IPC
well
(basically
for
parallel
linear
algebra
and
simula?ons)
• Rmpi
– More
code
– All
synchroniza?on
constructs
are
user-‐built
(just
like
MPI)
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
6
7. Comparison
to
Other
R
Parallelism
Frameworks
• Others…
– Only
other
Hadoop
libraries
have
integra?on
with
HDFS/are
appropriate
for
data
intensive
compu?ng
– Only
Rhadoop
supports
local
and
cluster
based
backends
and
has
an
intui?ve
interface
that
duplicates
closures
in
the
remote
environment
– Most
environments
are
targeted
towards
modeling
and
simula?on
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
7
8. InstallaBon
–
Local
WorkstaBon
• Install
R
– Macports
–
sudo port install r-framework!
– Ubuntu
–
sudo apt-get install r-base!
– RHEL
–
sudo yum install R!
• Install
R
dependencies
(inside
R)
– install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"),
repos="http://watson.nci.nih.gov/cran_mirror/”)!
• Install
RMR
– curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/
rmr_1.3.1.tar.gz > rmr.tar.gz!
– install.packages("rmr.tar.gz”) # from inside r, in the same
directory!
• Configure
the
local
backend
each
?me
you
run
R
– rmr.options.set(backend=“local”)!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
8
9. InstallaBon
-‐
Cluster
• Install
R
and
all
packages
you
plan
on
using
(rmr,
e1071,
topicmodels,
tm,
etc.)
on
each
node.
• Use
a
compa?ble
version
of
Hadoop
1
(1.0.3+
or
CDH3+).
Hadoop
2
may
or
may
not
work.
• The
example
on
the
previous
slide
installs
R
packages
in
your
home
directory,
you
probably
want
to
install
them
to
the
root
install.
• Configure
environment
variables
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/
streaming/hadoop-streaming-<version>.jar!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
9
10. The
Curse
of
Volume
of
the
Unit
Ball
vs.
Dimensionality
Dimensionality
• The
volume
of
the
unit
sphere
tends
towards
0
as
the
dimensionality
of
hyperspace
increases
• Intui?vely
this
means
that
there
is
more
“slop
room”
for
your
dividing
hyperplane
to
fall
into
• The
amount
of
data
we
need
to
train
a
model
rises
with
the
feature
space,
tending
towards
infinity,
making
the
problem
untenable
• With
a
small
feature
space,
there
is
no
need
for
lots
of
data
• Thus,
there
is
liCle
point
in
using
Hadoop
to
implement
many
classic
machine
learning
models
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
10
11. The
Hadoop
Data
Science
Flow
• Join
• Sample
• Model
• Repeat
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
11
12. Join
• Put
two
pieces
of
data
together
using
a
common
key
• Scenario:
– Data
is
in
two
flat
files
in
HDFS
– Turn
rows
into
rows
of
key-‐value
pairs,
where
the
key
is
the
join
key
and
the
value
is
the
rest
of
the
row
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
12
13. Sample
• Take
a
sample
of
your
(maybe)
joined
data
• Most
common
method
is
probabilis?cally
• Numerous
other
techniques
can
leverage
par??ons
and
randomness
of
the
key
hash
• Scenarios
(a
precursor
for):
– Supervised
learning/classifica?on
– Unsupervised
learning/clustering
– Regression
– Distribu?on
modeling
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
13
14. Model
• Supervised
learning:
I
want
to
predict
something
and
I
already
know
(some)
of
the
answers.
Also
called
classifica?on
and
binary
classifica?on
• Unsupervised
learning:
I
want
to
find
natural
groupings
in
the
data
that
I
might
not
have
known
about
• Regression,
probability
modeling
–
I
want
to
fit
a
curve
to
my
data
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
14
15. Repeat
• Gain
insight
about
the
data
• Change
your
procedure
(select
only
outliers,
etc.)
• Gain
more
insight
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
15
16. Rhadoop
Impact:
Join,
Sample
• Work
totally
in
R
• Execute
large,
complex
joins
such
as
cross
joins
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
16
17. Rhadoop
Impact:
Model
• Most
algorithms
work
perfectly
well
(or
beCer)
over
a
sample
of
the
data
• Train
and
cross-‐validate
a
large
number
of
models
in
parallel
• Perform
model
selec?on
in
the
reduce
phase
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
17
18. Rhadoop
API
mapreduce(!
input,!
output = NULL,!
map = to.map(identity),!
reduce = NULL,!
combine = NULL,!
reduce.on.data.frame = FALSE,!
input.format = "native",!
output.format = "native",!
vectorized = list(map = FALSE, reduce = FALSE),!
structured = list(map = FALSE, reduce = FALSE),!
backend.parameters = list(),!
verbose = TRUE)!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
18
19. Rhadoop
API
rmr.options.set(backend = c("hadoop", "local"),!
profile.nodes = NULL, vectorized.nrows = NULL)
!
to.dfs(object, output = dfs.tempfile(), !
format = "native")!
!
from.dfs(input, format = "native", !
to.data.frame = FALSE, vectorized = FALSE,!
structured = FALSE)
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
19
20. Doing
Things
the
R
Way
• Objects
– my_car = list(color=“green”, model=“volt”)!
• Transforming a vector (list), iterating
– lapply/sapply/tapply – functional programming constructs
• Loops (not preferred)
– for ( i in 1:100) {…}!
– Note this is the same as lapply(1:100, function(i){…})!
• Other control structures – basically as you would expect
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
20
21. Vectors
in
R
• R
helps
you!
O_o
• Every
object
has
a
mode
and
length
and
hence
can
be
interpreted
as
some
sort
of
vector
–
even
primi?ves!
• Even
primi?ves
such
as
strings
or
integers
are
stored
in
a
vector
of
length
1,
never
free-‐standing
• There
are
lots
of
types
of
vectors
– Lists
(think
linked
list)
– Atomic
vectors
(think
array)
hCp://cran.r-‐project.org/doc/manuals/R-‐intro.html#The-‐intrinsic-‐aCributes-‐
mode-‐and-‐length
• Type
coercion
usually
works
the
way
you
would
expect
– But…
you
may
find
yourself
using
as.list()
or
as.vector()
or
doing
manual
coercion
frequently
depending
on
what
libraries
you’re
using
due
to
mode
not
matching
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
21
22. Example
–
Fake
Data
fakedata = data.frame(x = c(rnorm(100)*.25, rep(.
75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)),
z = c(rep(0,100), rep(1,100)) )!
!
plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"],
function(z) ifelse(z>0,"blue","green")))!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
22
23. Examples
–
Simple
Parallelism
rmr.options.set(backend=“local”)!
!
ints = to.dfs(1:100)!
!
squares = mapreduce(ints, map=function(x)
reyval(NULL,x^2))!
!
print from.dfs(ints)!
!
# notice the result will be !
# keyvals!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
23
24. Examples
–
Trying
Lots
of
SVM
Kernels
kernels =
to.dfs(list("linear","polynomial","radial","sigmoid"
))!
!
models =
from.dfs(mapreduce(kernels,map=function(nothing,kern
)
keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))!
!
plot(models[[1]][["val"]],fakedata)!
!
!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
24
25. Examples
–
Different
Models
calls =
to.dfs(list(list("glm",z~.,family=binomial("logi
t"), fakedata),list("svm",z~.,fakedata)))!
!
models = from.dfs(mapreduce(calls,
map=function(nothing,callsig)
keyval(NULL,do.call(callsig[[1]],callsig[2:lengt
h(callsig)]))))!
!
models[[1]][["val"]]!
This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
25