Data Hacking with RHadoop

Rhadoop
Data
Hacking

Using
R
and
Hadoop
to
do
large-‐scale

data
science

Would
You
Like
to…

•  Predict
X?

–  The
outcome
of
a
future
event

–  Who
is
likely
to
do
something

–  Gene?c
factors
leading
to
disease

•  Pre-‐ﬁlter
things
so
humans
can
accomplish

more?

•  Do
all
of
this
faster
and
beCer?

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
2

Why
R
and
Hadoop?

•  R
is
a
fantas?c
plaHorm
for
data

science

–  Has
a
peer-‐reviewed
community

and
journal
that
vets
libraries

–  (Mostly)
intui?ve
language

•  Hadoop
is
the
de-‐facto
plaHorm

for
parallel
processing

•  Today,
we’ll
be
talking
about

rmr,
but
there’s
two
more

packages:
rhbase
and
rhdfs

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
3

Nothing
Has
Changed.
Everything
Has

Changed.

•  Some
of
the
most
eﬀec?ve
techniques
for
data
mining

are
rela?vely
old

–  Modern
SVM
dates
back
to
‘92

–  Logis?c
regression
dates
back
to
‘44

–  Important
elements
of
the
algorithms
date
back
to
Newton

•  Accessibility
and
relevance
have
changed

–  Accessibility
to
data

–  Accessibility
of
computa?onal
power

–  Necessity
of
methods

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
4

Some
CriBcisms
of
R
&
Rhadoop

•  R
docs
are
wriCen
in
their
own
language
(using
data

frames,
etc.)
that
is
unfamiliar
to
computer

scien?sts

•  R
and
CRAN
documenta?on
are
more
like
old-‐school

GNU
than
most
Apache
projects

–  Get
used
to
Googling
and
using
R’s
help()
func?on

•  R’s
data
management
facili?es
are
inconsistent

•  Streaming
API
isn’t
super
fast

•  (get
over
it)

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
5

Comparison
to
Other
R
Parallelism

Frameworks

•  SNOW/SNOWFALL

–  Operates
over
MPI,
Sockets,
or
PVM

–  No
?e-‐in
to
a
DFS
(bad
for
data-‐intensive
compu?ng)

–  Handles
matrix
mul?plica?on
well
(perhaps
beCer)

–  Doesn’t
handle
other
non-‐trivial
IPC
well
(basically
for
parallel
linear

algebra
and
simula?ons)

•  Rmpi

–  More
code

–  All
synchroniza?on
constructs
are
user-‐built
(just
like
MPI)

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
6

Comparison
to
Other
R
Parallelism

Frameworks

•  Others…

–  Only
other
Hadoop
libraries
have
integra?on
with

HDFS/are
appropriate
for
data
intensive

compu?ng

–  Only
Rhadoop
supports
local
and
cluster
based

backends
and
has
an
intui?ve
interface
that

duplicates
closures
in
the
remote
environment

–  Most
environments
are
targeted
towards

modeling
and
simula?on

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
7

InstallaBon
–
Local
WorkstaBon

•  Install
R

–  Macports
–
sudo port install r-framework!
–  Ubuntu
–
sudo apt-get install r-base!
–  RHEL
–
sudo yum install R!
•  Install
R
dependencies
(inside
R)

–  install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"),
repos="http://watson.nci.nih.gov/cran_mirror/”)!

•  Install
RMR

–  curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/
rmr_1.3.1.tar.gz > rmr.tar.gz!
–  install.packages("rmr.tar.gz”) # from inside r, in the same
directory!

•  Conﬁgure
the
local
backend
each
?me
you
run
R

–  rmr.options.set(backend=“local”)!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
8

InstallaBon
-‐
Cluster

•  Install
R
and
all
packages
you
plan
on
using
(rmr,
e1071,
topicmodels,
tm,

etc.)
on
each
node.

•  Use
a
compa?ble
version
of
Hadoop
1
(1.0.3+
or
CDH3+).
Hadoop
2
may

or
may
not
work.

•  The
example
on
the
previous
slide
installs
R
packages
in
your
home

directory,
you
probably
want
to
install
them
to
the
root
install.

•  Conﬁgure
environment
variables

export HADOOP_CMD=/usr/bin/hadoop 
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/
streaming/hadoop-streaming-<version>.jar!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
9

The
Curse
of
Volume
of
the
Unit
Ball
vs.
Dimensionality

Dimensionality

•  The
volume
of
the
unit
sphere

tends
towards
0
as
the

dimensionality
of
hyperspace

increases

•  Intui?vely
this
means
that
there
is

more
“slop
room”
for
your
dividing

hyperplane
to
fall
into

•  The
amount
of
data
we
need
to

train
a
model
rises
with
the

feature
space,
tending
towards

inﬁnity,
making
the
problem

untenable

•  With
a
small
feature
space,
there

is
no
need
for
lots
of
data

•  Thus,
there
is
liCle
point
in
using

Hadoop
to
implement
many
classic

machine
learning
models

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
10

The
Hadoop
Data
Science
Flow

•  Join

•  Sample

•  Model

•  Repeat

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
11

Join

•  Put
two
pieces
of
data
together
using
a

common
key

•  Scenario:

–  Data
is
in
two
flat
files
in
HDFS

–  Turn
rows
into
rows
of
key-‐value
pairs,
where
the

key
is
the
join
key
and
the
value
is
the
rest
of
the

row

This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
12

Sample

•  Take
a
sample
of
your
(maybe)
joined
data

•  Most
common
method
is
probabilis?cally

•  Numerous
other
techniques
can
leverage
par??ons

and
randomness
of
the
key
hash

•  Scenarios
(a
precursor
for):

–  Supervised
learning/classiﬁca?on

–  Unsupervised
learning/clustering

–  Regression

–  Distribu?on
modeling

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
13

Model

•  Supervised
learning:
I
want
to
predict
something
and

I
already
know
(some)
of
the
answers.
Also
called

classifica?on
and
binary
classifica?on

•  Unsupervised
learning:
I
want
to
find
natural

groupings
in
the
data
that
I
might
not
have
known

about

•  Regression,
probability
modeling
–
I
want
to
fit
a

curve
to
my
data

This
document
is
company
confiden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
14

Repeat

•  Gain
insight
about
the
data

•  Change
your
procedure
(select
only
outliers,

etc.)

•  Gain
more
insight

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
15

Rhadoop
Impact:
Join,
Sample

•  Work
totally
in
R

•  Execute
large,
complex
joins
such
as
cross

joins

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
16

Rhadoop
Impact:
Model

•  Most
algorithms
work
perfectly
well
(or

beCer)
over
a
sample
of
the
data

•  Train
and
cross-‐validate
a
large
number
of

models
in
parallel

•  Perform
model
selec?on
in
the
reduce
phase

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
17

Rhadoop
API

mapreduce(!
input,!
output = NULL,!
map = to.map(identity),!
reduce = NULL,!
combine = NULL,!
reduce.on.data.frame = FALSE,!
input.format = "native",!
output.format = "native",!
vectorized = list(map = FALSE, reduce = FALSE),!
structured = list(map = FALSE, reduce = FALSE),!
backend.parameters = list(),!
verbose = TRUE)!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
18

Rhadoop
API

rmr.options.set(backend = c("hadoop", "local"),!
profile.nodes = NULL, vectorized.nrows = NULL) 
!
to.dfs(object, output = dfs.tempfile(), !
format = "native")!
!
from.dfs(input, format = "native", !
to.data.frame = FALSE, vectorized = FALSE,!
structured = FALSE)

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
19

Doing
Things
the
R
Way

•  Objects

–  my_car = list(color=“green”, model=“volt”)!
•  Transforming a vector (list), iterating
–  lapply/sapply/tapply – functional programming constructs
•  Loops (not preferred)
–  for ( i in 1:100) {…}!
–  Note this is the same as lapply(1:100, function(i){…})!
•  Other control structures – basically as you would expect

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
20

Vectors
in
R

•  R
helps
you!
O_o

•  Every
object
has
a
mode
and
length
and
hence
can
be
interpreted
as
some

sort
of
vector
–
even
primi?ves!

•  Even
primi?ves
such
as
strings
or
integers
are
stored
in
a
vector
of
length

1,
never
free-‐standing

•  There
are
lots
of
types
of
vectors

–  Lists
(think
linked
list)

–  Atomic
vectors
(think
array)

hCp://cran.r-‐project.org/doc/manuals/R-‐intro.html#The-‐intrinsic-‐aCributes-‐
mode-‐and-‐length

•  Type
coercion
usually
works
the
way
you
would
expect

–  But…
you
may
ﬁnd
yourself
using
as.list()
or
as.vector()
or
doing
manual
coercion

frequently
depending
on
what
libraries
you’re
using
due
to
mode
not
matching

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
21

Example
–
Fake
Data

fakedata = data.frame(x = c(rnorm(100)*.25, rep(.
75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)),
z = c(rep(0,100), rep(1,100)) )!
!
plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"],
function(z) ifelse(z>0,"blue","green")))!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
22

Examples
–
Simple
Parallelism

rmr.options.set(backend=“local”)!
!
ints = to.dfs(1:100)!
!
squares = mapreduce(ints, map=function(x)
reyval(NULL,x^2))!
!
print from.dfs(ints)!
!
# notice the result will be !
# keyvals!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
23

Examples
–
Trying
Lots
of
SVM
Kernels

kernels =
to.dfs(list("linear","polynomial","radial","sigmoid"
))!
!
models =
from.dfs(mapreduce(kernels,map=function(nothing,kern
)
keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))!
!
plot(models[[1]][["val"]],fakedata)!
!
!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
24

Examples
–
Diﬀerent
Models

calls =
to.dfs(list(list("glm",z~.,family=binomial("logi
t"), fakedata),list("svm",z~.,fakedata)))!
!
models = from.dfs(mapreduce(calls,
map=function(nothing,callsig)
keyval(NULL,do.call(callsig[[1]],callsig[2:lengt
h(callsig)]))))!
!
models[[1]][["val"]]!

This
document
is
company
conﬁden?al
and
is
intended
solely
for
the
use
and
informa?on
of
Booz
Allen
Hamilton
25

Data Hacking with RHadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Hacking with RHadoop

Similar to Data Hacking with RHadoop (20)

Recently uploaded

Recently uploaded (20)

Data Hacking with RHadoop