Data Hacking with RHadoop
Upcoming SlideShare
Loading in...5
×
 

Data Hacking with RHadoop

on

  • 2,512 views

Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and ...

Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science.

We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.

Statistics

Views

Total Views
2,512
Views on SlideShare
2,506
Embed Views
6

Actions

Likes
3
Downloads
92
Comments
1

3 Embeds 6

http://www.slashdocs.com 4
https://twimg0-a.akamaihd.net 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Hacking with RHadoop Data Hacking with RHadoop Presentation Transcript

  • Rhadoop  Data  Hacking  Using  R  and  Hadoop  to  do  large-­‐scale   data  science  
  • Would  You  Like  to…   •  Predict  X?   –  The  outcome  of  a  future  event   –  Who  is  likely  to  do  something   –  Gene?c  factors  leading  to  disease   •  Pre-­‐filter  things  so  humans  can  accomplish   more?   •  Do  all  of  this  faster  and  beCer?  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   2  
  • Why  R  and  Hadoop?   •  R  is  a  fantas?c  plaHorm  for  data   science   –  Has  a  peer-­‐reviewed  community   and  journal  that  vets  libraries   –  (Mostly)  intui?ve  language   •  Hadoop  is  the  de-­‐facto  plaHorm   for  parallel  processing   •  Today,  we’ll  be  talking  about   rmr,  but  there’s  two  more   packages:  rhbase  and  rhdfs  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   3  
  • Nothing  Has  Changed.  Everything  Has   Changed.   •  Some  of  the  most  effec?ve  techniques  for  data  mining   are  rela?vely  old   –  Modern  SVM  dates  back  to  ‘92   –  Logis?c  regression  dates  back  to  ‘44   –  Important  elements  of  the  algorithms  date  back  to  Newton   •  Accessibility  and  relevance  have  changed   –  Accessibility  to  data   –  Accessibility  of  computa?onal  power   –  Necessity  of  methods  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   4  
  • Some  CriBcisms  of  R  &  Rhadoop   •  R  docs  are  wriCen  in  their  own  language  (using  data   frames,  etc.)  that  is  unfamiliar  to  computer   scien?sts   •  R  and  CRAN  documenta?on  are  more  like  old-­‐school   GNU  than  most  Apache  projects   –  Get  used  to  Googling  and  using  R’s  help()  func?on   •  R’s  data  management  facili?es  are  inconsistent   •  Streaming  API  isn’t  super  fast   •  (get  over  it)  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   5  
  • Comparison  to  Other  R  Parallelism   Frameworks   •  SNOW/SNOWFALL   –  Operates  over  MPI,  Sockets,  or  PVM   –  No  ?e-­‐in  to  a  DFS  (bad  for  data-­‐intensive  compu?ng)   –  Handles  matrix  mul?plica?on  well  (perhaps  beCer)   –  Doesn’t  handle  other  non-­‐trivial  IPC  well  (basically  for  parallel  linear   algebra  and  simula?ons)   •  Rmpi   –  More  code   –  All  synchroniza?on  constructs  are  user-­‐built  (just  like  MPI)  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   6  
  • Comparison  to  Other  R  Parallelism   Frameworks   •  Others…   –  Only  other  Hadoop  libraries  have  integra?on  with   HDFS/are  appropriate  for  data  intensive   compu?ng   –  Only  Rhadoop  supports  local  and  cluster  based   backends  and  has  an  intui?ve  interface  that   duplicates  closures  in  the  remote  environment   –  Most  environments  are  targeted  towards   modeling  and  simula?on  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   7  
  • InstallaBon  –  Local  WorkstaBon   •  Install  R   –  Macports  –  sudo port install r-framework! –  Ubuntu  –  sudo apt-get install r-base! –  RHEL  –  sudo yum install R! •  Install  R  dependencies  (inside  R)   –  install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"), repos="http://watson.nci.nih.gov/cran_mirror/”)! •  Install  RMR   –  curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/ rmr_1.3.1.tar.gz > rmr.tar.gz! –  install.packages("rmr.tar.gz”) # from inside r, in the same directory! •  Configure  the  local  backend  each  ?me  you  run  R   –  rmr.options.set(backend=“local”)!This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   8  
  • InstallaBon  -­‐  Cluster   •  Install  R  and  all  packages  you  plan  on  using  (rmr,  e1071,  topicmodels,  tm,   etc.)  on  each  node.   •  Use  a  compa?ble  version  of  Hadoop  1  (1.0.3+  or  CDH3+).  Hadoop  2  may   or  may  not  work.   •  The  example  on  the  previous  slide  installs  R  packages  in  your  home   directory,  you  probably  want  to  install  them  to  the  root  install.   •  Configure  environment  variables   export HADOOP_CMD=/usr/bin/hadoop
 export HADOOP_STREAMING=/usr/lib/hadoop/contrib/ streaming/hadoop-streaming-<version>.jar!This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   9  
  • The  Curse  of   Volume  of  the  Unit  Ball  vs.  Dimensionality   Dimensionality   •  The  volume  of  the  unit  sphere   tends  towards  0  as  the   dimensionality  of  hyperspace   increases   •  Intui?vely  this  means  that  there  is   more  “slop  room”  for  your  dividing   hyperplane  to  fall  into   •  The  amount  of  data  we  need  to   train  a  model  rises  with  the   feature  space,  tending  towards   infinity,  making  the  problem   untenable   •  With  a  small  feature  space,  there   is  no  need  for  lots  of  data   •  Thus,  there  is  liCle  point  in  using   Hadoop  to  implement  many  classic   machine  learning  models  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   10  
  • The  Hadoop  Data  Science  Flow   •  Join   •  Sample   •  Model   •  Repeat  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   11  
  • Join   •  Put  two  pieces  of  data  together  using  a   common  key   •  Scenario:   –  Data  is  in  two  flat  files  in  HDFS   –  Turn  rows  into  rows  of  key-­‐value  pairs,  where  the   key  is  the  join  key  and  the  value  is  the  rest  of  the   row  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   12  
  • Sample   •  Take  a  sample  of  your  (maybe)  joined  data   •  Most  common  method  is  probabilis?cally   •  Numerous  other  techniques  can  leverage  par??ons   and  randomness  of  the  key  hash   •  Scenarios  (a  precursor  for):   –  Supervised  learning/classifica?on   –  Unsupervised  learning/clustering   –  Regression   –  Distribu?on  modeling  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   13  
  • Model   •  Supervised  learning:  I  want  to  predict  something  and   I  already  know  (some)  of  the  answers.  Also  called   classifica?on  and  binary  classifica?on   •  Unsupervised  learning:  I  want  to  find  natural   groupings  in  the  data  that  I  might  not  have  known   about   •  Regression,  probability  modeling  –  I  want  to  fit  a   curve  to  my  data  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   14  
  • Repeat   •  Gain  insight  about  the  data   •  Change  your  procedure  (select  only  outliers,   etc.)   •  Gain  more  insight  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   15  
  • Rhadoop  Impact:  Join,  Sample   •  Work  totally  in  R   •  Execute  large,  complex  joins  such  as  cross   joins  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   16  
  • Rhadoop  Impact:  Model   •  Most  algorithms  work  perfectly  well  (or   beCer)  over  a  sample  of  the  data   •  Train  and  cross-­‐validate  a  large  number  of   models  in  parallel   •  Perform  model  selec?on  in  the  reduce  phase  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   17  
  • Rhadoop  API   mapreduce(! input,! output = NULL,! map = to.map(identity),! reduce = NULL,! combine = NULL,! reduce.on.data.frame = FALSE,! input.format = "native",! output.format = "native",! vectorized = list(map = FALSE, reduce = FALSE),! structured = list(map = FALSE, reduce = FALSE),! backend.parameters = list(),! verbose = TRUE)!This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   18  
  • Rhadoop  API   rmr.options.set(backend = c("hadoop", "local"),! profile.nodes = NULL, vectorized.nrows = NULL)
 ! to.dfs(object, output = dfs.tempfile(), ! format = "native")! ! from.dfs(input, format = "native", ! to.data.frame = FALSE, vectorized = FALSE,! structured = FALSE)  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   19  
  • Doing  Things  the  R  Way   •  Objects   –  my_car = list(color=“green”, model=“volt”)! •  Transforming a vector (list), iterating –  lapply/sapply/tapply – functional programming constructs •  Loops (not preferred) –  for ( i in 1:100) {…}! –  Note this is the same as lapply(1:100, function(i){…})! •  Other control structures – basically as you would expectThis  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   20  
  • Vectors  in  R   •  R  helps  you!  O_o   •  Every  object  has  a  mode  and  length  and  hence  can  be  interpreted  as  some   sort  of  vector  –  even  primi?ves!   •  Even  primi?ves  such  as  strings  or  integers  are  stored  in  a  vector  of  length   1,  never  free-­‐standing   •  There  are  lots  of  types  of  vectors   –  Lists  (think  linked  list)   –  Atomic  vectors  (think  array)   hCp://cran.r-­‐project.org/doc/manuals/R-­‐intro.html#The-­‐intrinsic-­‐aCributes-­‐ mode-­‐and-­‐length   •  Type  coercion  usually  works  the  way  you  would  expect   –  But…  you  may  find  yourself  using  as.list()  or  as.vector()  or  doing  manual  coercion   frequently  depending  on  what  libraries  you’re  using  due  to  mode  not  matching  This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   21  
  • Example  –  Fake  Data   fakedata = data.frame(x = c(rnorm(100)*.25, rep(. 75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)), z = c(rep(0,100), rep(1,100)) )! ! plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"], function(z) ifelse(z>0,"blue","green")))!This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   22  
  • Examples  –  Simple  Parallelism   rmr.options.set(backend=“local”)! ! ints = to.dfs(1:100)! ! squares = mapreduce(ints, map=function(x) reyval(NULL,x^2))! ! print from.dfs(ints)! ! # notice the result will be ! # keyvals!This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   23  
  • Examples  –  Trying  Lots  of  SVM  Kernels   kernels = to.dfs(list("linear","polynomial","radial","sigmoid" ))! ! models = from.dfs(mapreduce(kernels,map=function(nothing,kern ) keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))! ! plot(models[[1]][["val"]],fakedata)! ! !This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   24  
  • Examples  –  Different  Models   calls = to.dfs(list(list("glm",z~.,family=binomial("logi t"), fakedata),list("svm",z~.,fakedata)))! ! models = from.dfs(mapreduce(calls, map=function(nothing,callsig) keyval(NULL,do.call(callsig[[1]],callsig[2:lengt h(callsig)]))))! ! models[[1]][["val"]]!This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   25