SlideShare a Scribd company logo
1 of 25
Download to read offline
Rhadoop	
  Data	
  Hacking	
  

Using	
  R	
  and	
  Hadoop	
  to	
  do	
  large-­‐scale	
  
                                      data	
  science	
  
Would	
  You	
  Like	
  to…	
  

                  •  Predict	
  X?	
  
                                  –  The	
  outcome	
  of	
  a	
  future	
  event	
  
                                  –  Who	
  is	
  likely	
  to	
  do	
  something	
  
                                  –  Gene?c	
  factors	
  leading	
  to	
  disease	
  
                  •  Pre-­‐filter	
  things	
  so	
  humans	
  can	
  accomplish	
  
                     more?	
  
                  •  Do	
  all	
  of	
  this	
  faster	
  and	
  beCer?	
  


This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     2	
  
Why	
  R	
  and	
  Hadoop?	
  

                  •  R	
  is	
  a	
  fantas?c	
  plaHorm	
  for	
  data	
  
                     science	
  
                                  –  Has	
  a	
  peer-­‐reviewed	
  community	
  
                                     and	
  journal	
  that	
  vets	
  libraries	
  
                                  –  (Mostly)	
  intui?ve	
  language	
  
                  •  Hadoop	
  is	
  the	
  de-­‐facto	
  plaHorm	
  
                     for	
  parallel	
  processing	
  
                  •  Today,	
  we’ll	
  be	
  talking	
  about	
  
                     rmr,	
  but	
  there’s	
  two	
  more	
  
                     packages:	
  rhbase	
  and	
  rhdfs	
  



This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     3	
  
Nothing	
  Has	
  Changed.	
  Everything	
  Has	
  
                                                     Changed.	
  
                  •  Some	
  of	
  the	
  most	
  effec?ve	
  techniques	
  for	
  data	
  mining	
  
                     are	
  rela?vely	
  old	
  
                                  –  Modern	
  SVM	
  dates	
  back	
  to	
  ‘92	
  
                                  –  Logis?c	
  regression	
  dates	
  back	
  to	
  ‘44	
  
                                  –  Important	
  elements	
  of	
  the	
  algorithms	
  date	
  back	
  to	
  Newton	
  
                  •  Accessibility	
  and	
  relevance	
  have	
  changed	
  
                                  –  Accessibility	
  to	
  data	
  
                                  –  Accessibility	
  of	
  computa?onal	
  power	
  
                                  –  Necessity	
  of	
  methods	
  



This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     4	
  
Some	
  CriBcisms	
  of	
  R	
  &	
  Rhadoop	
  

                  •  R	
  docs	
  are	
  wriCen	
  in	
  their	
  own	
  language	
  (using	
  data	
  
                     frames,	
  etc.)	
  that	
  is	
  unfamiliar	
  to	
  computer	
  
                     scien?sts	
  
                  •  R	
  and	
  CRAN	
  documenta?on	
  are	
  more	
  like	
  old-­‐school	
  
                     GNU	
  than	
  most	
  Apache	
  projects	
  
                                  –  Get	
  used	
  to	
  Googling	
  and	
  using	
  R’s	
  help()	
  func?on	
  
                  •  R’s	
  data	
  management	
  facili?es	
  are	
  inconsistent	
  
                  •  Streaming	
  API	
  isn’t	
  super	
  fast	
  
                  •  (get	
  over	
  it)	
  

This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     5	
  
Comparison	
  to	
  Other	
  R	
  Parallelism	
  
                                                     Frameworks	
  
                  •  SNOW/SNOWFALL	
  
                                  –         Operates	
  over	
  MPI,	
  Sockets,	
  or	
  PVM	
  
                                  –         No	
  ?e-­‐in	
  to	
  a	
  DFS	
  (bad	
  for	
  data-­‐intensive	
  compu?ng)	
  
                                  –         Handles	
  matrix	
  mul?plica?on	
  well	
  (perhaps	
  beCer)	
  
                                  –         Doesn’t	
  handle	
  other	
  non-­‐trivial	
  IPC	
  well	
  (basically	
  for	
  parallel	
  linear	
  
                                            algebra	
  and	
  simula?ons)	
  
                  •  Rmpi	
  
                                  –  More	
  code	
  
                                  –  All	
  synchroniza?on	
  constructs	
  are	
  user-­‐built	
  (just	
  like	
  MPI)	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     6	
  
Comparison	
  to	
  Other	
  R	
  Parallelism	
  
                                                     Frameworks	
  
                  •  Others…	
  
                                  –  Only	
  other	
  Hadoop	
  libraries	
  have	
  integra?on	
  with	
  
                                     HDFS/are	
  appropriate	
  for	
  data	
  intensive	
  
                                     compu?ng	
  
                                  –  Only	
  Rhadoop	
  supports	
  local	
  and	
  cluster	
  based	
  
                                     backends	
  and	
  has	
  an	
  intui?ve	
  interface	
  that	
  
                                     duplicates	
  closures	
  in	
  the	
  remote	
  environment	
  
                                  –  Most	
  environments	
  are	
  targeted	
  towards	
  
                                     modeling	
  and	
  simula?on	
  


This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     7	
  
InstallaBon	
  –	
  Local	
  WorkstaBon	
  
                  •           Install	
  R	
  
                                  –  Macports	
  –	
  sudo port install r-framework!
                                  –  Ubuntu	
  –	
  sudo apt-get install r-base!
                                  –  RHEL	
  –	
  sudo yum install R!
                  •           Install	
  R	
  dependencies	
  (inside	
  R)	
  
                                  –  install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"),
                                     repos="http://watson.nci.nih.gov/cran_mirror/”)!

                  •  Install	
  RMR	
  
                                  –  curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/
                                     rmr_1.3.1.tar.gz > rmr.tar.gz!
                                  –  install.packages("rmr.tar.gz”) # from inside r, in the same
                                     directory!

                  •  Configure	
  the	
  local	
  backend	
  each	
  ?me	
  you	
  run	
  R	
  
                                  –  rmr.options.set(backend=“local”)!




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     8	
  
InstallaBon	
  -­‐	
  Cluster	
  
                  •  Install	
  R	
  and	
  all	
  packages	
  you	
  plan	
  on	
  using	
  (rmr,	
  e1071,	
  topicmodels,	
  tm,	
  
                     etc.)	
  on	
  each	
  node.	
  
                  •  Use	
  a	
  compa?ble	
  version	
  of	
  Hadoop	
  1	
  (1.0.3+	
  or	
  CDH3+).	
  Hadoop	
  2	
  may	
  
                     or	
  may	
  not	
  work.	
  
                  •  The	
  example	
  on	
  the	
  previous	
  slide	
  installs	
  R	
  packages	
  in	
  your	
  home	
  
                     directory,	
  you	
  probably	
  want	
  to	
  install	
  them	
  to	
  the	
  root	
  install.	
  
                  •  Configure	
  environment	
  variables	
  
                     export HADOOP_CMD=/usr/bin/hadoop

                     export HADOOP_STREAMING=/usr/lib/hadoop/contrib/
                     streaming/hadoop-streaming-<version>.jar!




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     9	
  
The	
  Curse	
  of	
                                                                                                                        Volume	
  of	
  the	
  Unit	
  Ball	
  vs.	
  Dimensionality	
  
                  Dimensionality	
  
                  •         The	
  volume	
  of	
  the	
  unit	
  sphere	
  
                            tends	
  towards	
  0	
  as	
  the	
  
                            dimensionality	
  of	
  hyperspace	
  
                            increases	
  
                  •         Intui?vely	
  this	
  means	
  that	
  there	
  is	
  
                            more	
  “slop	
  room”	
  for	
  your	
  dividing	
  
                            hyperplane	
  to	
  fall	
  into	
  
                  •         The	
  amount	
  of	
  data	
  we	
  need	
  to	
  
                            train	
  a	
  model	
  rises	
  with	
  the	
  
                            feature	
  space,	
  tending	
  towards	
  
                            infinity,	
  making	
  the	
  problem	
  
                            untenable	
  
                  •         With	
  a	
  small	
  feature	
  space,	
  there	
  
                            is	
  no	
  need	
  for	
  lots	
  of	
  data	
  
                  •         Thus,	
  there	
  is	
  liCle	
  point	
  in	
  using	
  
                            Hadoop	
  to	
  implement	
  many	
  classic	
  
                            machine	
  learning	
  models	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
                                                            10	
  
The	
  Hadoop	
  Data	
  Science	
  Flow	
  

                  •           Join	
  
                  •           Sample	
  
                  •           Model	
  
                  •           Repeat	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     11	
  
Join	
  

                  •  Put	
  two	
  pieces	
  of	
  data	
  together	
  using	
  a	
  
                     common	
  key	
  
                  •  Scenario:	
  
                                  –  Data	
  is	
  in	
  two	
  flat	
  files	
  in	
  HDFS	
  
                                  –  Turn	
  rows	
  into	
  rows	
  of	
  key-­‐value	
  pairs,	
  where	
  the	
  
                                     key	
  is	
  the	
  join	
  key	
  and	
  the	
  value	
  is	
  the	
  rest	
  of	
  the	
  
                                     row	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     12	
  
Sample	
  

                  •  Take	
  a	
  sample	
  of	
  your	
  (maybe)	
  joined	
  data	
  
                  •  Most	
  common	
  method	
  is	
  probabilis?cally	
  
                  •  Numerous	
  other	
  techniques	
  can	
  leverage	
  par??ons	
  
                     and	
  randomness	
  of	
  the	
  key	
  hash	
  
                  •  Scenarios	
  (a	
  precursor	
  for):	
  
                                  –  Supervised	
  learning/classifica?on	
  
                                  –  Unsupervised	
  learning/clustering	
  
                                  –  Regression	
  
                                  –  Distribu?on	
  modeling	
  


This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     13	
  
Model	
  

                  •  Supervised	
  learning:	
  I	
  want	
  to	
  predict	
  something	
  and	
  
                     I	
  already	
  know	
  (some)	
  of	
  the	
  answers.	
  Also	
  called	
  
                     classifica?on	
  and	
  binary	
  classifica?on	
  
                  •  Unsupervised	
  learning:	
  I	
  want	
  to	
  find	
  natural	
  
                     groupings	
  in	
  the	
  data	
  that	
  I	
  might	
  not	
  have	
  known	
  
                     about	
  
                  •  Regression,	
  probability	
  modeling	
  –	
  I	
  want	
  to	
  fit	
  a	
  
                     curve	
  to	
  my	
  data	
  



This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     14	
  
Repeat	
  

                  •  Gain	
  insight	
  about	
  the	
  data	
  
                  •  Change	
  your	
  procedure	
  (select	
  only	
  outliers,	
  
                     etc.)	
  
                  •  Gain	
  more	
  insight	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     15	
  
Rhadoop	
  Impact:	
  Join,	
  Sample	
  

                  •  Work	
  totally	
  in	
  R	
  
                  •  Execute	
  large,	
  complex	
  joins	
  such	
  as	
  cross	
  
                     joins	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     16	
  
Rhadoop	
  Impact:	
  Model	
  

                  •  Most	
  algorithms	
  work	
  perfectly	
  well	
  (or	
  
                     beCer)	
  over	
  a	
  sample	
  of	
  the	
  data	
  
                  •  Train	
  and	
  cross-­‐validate	
  a	
  large	
  number	
  of	
  
                     models	
  in	
  parallel	
  
                  •  Perform	
  model	
  selec?on	
  in	
  the	
  reduce	
  phase	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     17	
  
Rhadoop	
  API	
  
                  mapreduce(!
                         input,!
                         output = NULL,!
                         map = to.map(identity),!
                         reduce = NULL,!
                         combine = NULL,!
                         reduce.on.data.frame = FALSE,!
                         input.format = "native",!
                         output.format = "native",!
                         vectorized = list(map = FALSE, reduce = FALSE),!
                         structured = list(map = FALSE, reduce = FALSE),!
                         backend.parameters = list(),!
                         verbose = TRUE)!




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     18	
  
Rhadoop	
  API	
  
                  rmr.options.set(backend = c("hadoop", "local"),!
                      profile.nodes = NULL, vectorized.nrows = NULL)

                  !
                  to.dfs(object, output = dfs.tempfile(), !
                      format = "native")!
                  !
                  from.dfs(input, format = "native", !
                      to.data.frame = FALSE, vectorized = FALSE,!
                      structured = FALSE)	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     19	
  
Doing	
  Things	
  the	
  R	
  Way	
  

                  •  Objects	
  
                                  –  my_car = list(color=“green”, model=“volt”)!
                  •  Transforming a vector (list), iterating
                                  –  lapply/sapply/tapply – functional programming constructs
                  •  Loops (not preferred)
                                  –  for ( i in 1:100) {…}!
                                  –  Note this is the same as lapply(1:100, function(i){…})!
                  •  Other control structures – basically as you would expect




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     20	
  
Vectors	
  in	
  R	
  
                  •  R	
  helps	
  you!	
  O_o	
  
                  •  Every	
  object	
  has	
  a	
  mode	
  and	
  length	
  and	
  hence	
  can	
  be	
  interpreted	
  as	
  some	
  
                     sort	
  of	
  vector	
  –	
  even	
  primi?ves!	
  
                  •  Even	
  primi?ves	
  such	
  as	
  strings	
  or	
  integers	
  are	
  stored	
  in	
  a	
  vector	
  of	
  length	
  
                     1,	
  never	
  free-­‐standing	
  
                  •  There	
  are	
  lots	
  of	
  types	
  of	
  vectors	
  
                                  –  Lists	
  (think	
  linked	
  list)	
  
                                  –  Atomic	
  vectors	
  (think	
  array)	
  
                                     hCp://cran.r-­‐project.org/doc/manuals/R-­‐intro.html#The-­‐intrinsic-­‐aCributes-­‐
                                     mode-­‐and-­‐length	
  
                  •  Type	
  coercion	
  usually	
  works	
  the	
  way	
  you	
  would	
  expect	
  
                                  –  But…	
  you	
  may	
  find	
  yourself	
  using	
  as.list()	
  or	
  as.vector()	
  or	
  doing	
  manual	
  coercion	
  
                                     frequently	
  depending	
  on	
  what	
  libraries	
  you’re	
  using	
  due	
  to	
  mode	
  not	
  matching	
  




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     21	
  
Example	
  –	
  Fake	
  Data	
  
                 fakedata = data.frame(x = c(rnorm(100)*.25, rep(.
                 75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)),
                 z = c(rep(0,100), rep(1,100)) )!
                 !
                 plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"],
                 function(z) ifelse(z>0,"blue","green")))!




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     22	
  
Examples	
  –	
  Simple	
  Parallelism	
  
                  rmr.options.set(backend=“local”)!
                  !
                  ints = to.dfs(1:100)!
                  !
                  squares = mapreduce(ints, map=function(x)
                  reyval(NULL,x^2))!
                  !
                  print from.dfs(ints)!
                  !
                  # notice the result will be !
                  # keyvals!




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     23	
  
Examples	
  –	
  Trying	
  Lots	
  of	
  SVM	
  Kernels	
  
                  kernels =
                  to.dfs(list("linear","polynomial","radial","sigmoid"
                  ))!
                  !
                  models =
                  from.dfs(mapreduce(kernels,map=function(nothing,kern
                  )
                  keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))!
                  !
                  plot(models[[1]][["val"]],fakedata)!
                  !
                  !



This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     24	
  
Examples	
  –	
  Different	
  Models	
  
                  calls =
                  to.dfs(list(list("glm",z~.,family=binomial("logi
                  t"), fakedata),list("svm",z~.,fakedata)))!
                  !
                  models = from.dfs(mapreduce(calls,
                  map=function(nothing,callsig)
                  keyval(NULL,do.call(callsig[[1]],callsig[2:lengt
                  h(callsig)]))))!
                  !
                  models[[1]][["val"]]!




This	
  document	
  is	
  company	
  confiden?al	
  and	
  is	
  intended	
  solely	
  for	
  the	
  use	
  and	
  informa?on	
  of	
  Booz	
  Allen	
  Hamilton	
     25	
  

More Related Content

What's hot

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 

What's hot (20)

HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 

Viewers also liked

OBIEE Answers Vs Data Visualization: A Cage Match
OBIEE Answers Vs Data Visualization: A Cage MatchOBIEE Answers Vs Data Visualization: A Cage Match
OBIEE Answers Vs Data Visualization: A Cage MatchMichelle Kolbe
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from RJeffrey Breen
 
Hp distributed R User Guide
Hp distributed R User GuideHp distributed R User Guide
Hp distributed R User GuideAndrey Karpov
 
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...Revolution Analytics
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practicesBlaise Cheuteu
 
Data Exploration, Validation and Sanitization
Data Exploration, Validation and SanitizationData Exploration, Validation and Sanitization
Data Exploration, Validation and SanitizationVenkata Reddy Konasani
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 

Viewers also liked (20)

OBIEE Answers Vs Data Visualization: A Cage Match
OBIEE Answers Vs Data Visualization: A Cage MatchOBIEE Answers Vs Data Visualization: A Cage Match
OBIEE Answers Vs Data Visualization: A Cage Match
 
Excel/R
Excel/RExcel/R
Excel/R
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from R
 
Applications of R (DataWeek 2014)
Applications of R (DataWeek 2014)Applications of R (DataWeek 2014)
Applications of R (DataWeek 2014)
 
Hp distributed R User Guide
Hp distributed R User GuideHp distributed R User Guide
Hp distributed R User Guide
 
R crash course
R crash courseR crash course
R crash course
 
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
RHadoop, R meets Hadoop
RHadoop, R meets HadoopRHadoop, R meets Hadoop
RHadoop, R meets Hadoop
 
Data Exploration, Validation and Sanitization
Data Exploration, Validation and SanitizationData Exploration, Validation and Sanitization
Data Exploration, Validation and Sanitization
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
ARIMA
ARIMA ARIMA
ARIMA
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Decision tree
Decision treeDecision tree
Decision tree
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Correlation and Simple Regression
Correlation  and Simple RegressionCorrelation  and Simple Regression
Correlation and Simple Regression
 

Similar to Data Hacking with RHadoop

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 

Similar to Data Hacking with RHadoop (20)

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Recently uploaded

All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Recently uploaded (20)

All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Data Hacking with RHadoop

  • 1. Rhadoop  Data  Hacking   Using  R  and  Hadoop  to  do  large-­‐scale   data  science  
  • 2. Would  You  Like  to…   •  Predict  X?   –  The  outcome  of  a  future  event   –  Who  is  likely  to  do  something   –  Gene?c  factors  leading  to  disease   •  Pre-­‐filter  things  so  humans  can  accomplish   more?   •  Do  all  of  this  faster  and  beCer?   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   2  
  • 3. Why  R  and  Hadoop?   •  R  is  a  fantas?c  plaHorm  for  data   science   –  Has  a  peer-­‐reviewed  community   and  journal  that  vets  libraries   –  (Mostly)  intui?ve  language   •  Hadoop  is  the  de-­‐facto  plaHorm   for  parallel  processing   •  Today,  we’ll  be  talking  about   rmr,  but  there’s  two  more   packages:  rhbase  and  rhdfs   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   3  
  • 4. Nothing  Has  Changed.  Everything  Has   Changed.   •  Some  of  the  most  effec?ve  techniques  for  data  mining   are  rela?vely  old   –  Modern  SVM  dates  back  to  ‘92   –  Logis?c  regression  dates  back  to  ‘44   –  Important  elements  of  the  algorithms  date  back  to  Newton   •  Accessibility  and  relevance  have  changed   –  Accessibility  to  data   –  Accessibility  of  computa?onal  power   –  Necessity  of  methods   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   4  
  • 5. Some  CriBcisms  of  R  &  Rhadoop   •  R  docs  are  wriCen  in  their  own  language  (using  data   frames,  etc.)  that  is  unfamiliar  to  computer   scien?sts   •  R  and  CRAN  documenta?on  are  more  like  old-­‐school   GNU  than  most  Apache  projects   –  Get  used  to  Googling  and  using  R’s  help()  func?on   •  R’s  data  management  facili?es  are  inconsistent   •  Streaming  API  isn’t  super  fast   •  (get  over  it)   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   5  
  • 6. Comparison  to  Other  R  Parallelism   Frameworks   •  SNOW/SNOWFALL   –  Operates  over  MPI,  Sockets,  or  PVM   –  No  ?e-­‐in  to  a  DFS  (bad  for  data-­‐intensive  compu?ng)   –  Handles  matrix  mul?plica?on  well  (perhaps  beCer)   –  Doesn’t  handle  other  non-­‐trivial  IPC  well  (basically  for  parallel  linear   algebra  and  simula?ons)   •  Rmpi   –  More  code   –  All  synchroniza?on  constructs  are  user-­‐built  (just  like  MPI)   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   6  
  • 7. Comparison  to  Other  R  Parallelism   Frameworks   •  Others…   –  Only  other  Hadoop  libraries  have  integra?on  with   HDFS/are  appropriate  for  data  intensive   compu?ng   –  Only  Rhadoop  supports  local  and  cluster  based   backends  and  has  an  intui?ve  interface  that   duplicates  closures  in  the  remote  environment   –  Most  environments  are  targeted  towards   modeling  and  simula?on   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   7  
  • 8. InstallaBon  –  Local  WorkstaBon   •  Install  R   –  Macports  –  sudo port install r-framework! –  Ubuntu  –  sudo apt-get install r-base! –  RHEL  –  sudo yum install R! •  Install  R  dependencies  (inside  R)   –  install.packages(c("Rcpp", "RJSONIO", "itertools", "digest"), repos="http://watson.nci.nih.gov/cran_mirror/”)! •  Install  RMR   –  curl http://cloud.github.com/downloads/RevolutionAnalytics/RHadoop/ rmr_1.3.1.tar.gz > rmr.tar.gz! –  install.packages("rmr.tar.gz”) # from inside r, in the same directory! •  Configure  the  local  backend  each  ?me  you  run  R   –  rmr.options.set(backend=“local”)! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   8  
  • 9. InstallaBon  -­‐  Cluster   •  Install  R  and  all  packages  you  plan  on  using  (rmr,  e1071,  topicmodels,  tm,   etc.)  on  each  node.   •  Use  a  compa?ble  version  of  Hadoop  1  (1.0.3+  or  CDH3+).  Hadoop  2  may   or  may  not  work.   •  The  example  on  the  previous  slide  installs  R  packages  in  your  home   directory,  you  probably  want  to  install  them  to  the  root  install.   •  Configure  environment  variables   export HADOOP_CMD=/usr/bin/hadoop
 export HADOOP_STREAMING=/usr/lib/hadoop/contrib/ streaming/hadoop-streaming-<version>.jar! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   9  
  • 10. The  Curse  of   Volume  of  the  Unit  Ball  vs.  Dimensionality   Dimensionality   •  The  volume  of  the  unit  sphere   tends  towards  0  as  the   dimensionality  of  hyperspace   increases   •  Intui?vely  this  means  that  there  is   more  “slop  room”  for  your  dividing   hyperplane  to  fall  into   •  The  amount  of  data  we  need  to   train  a  model  rises  with  the   feature  space,  tending  towards   infinity,  making  the  problem   untenable   •  With  a  small  feature  space,  there   is  no  need  for  lots  of  data   •  Thus,  there  is  liCle  point  in  using   Hadoop  to  implement  many  classic   machine  learning  models   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   10  
  • 11. The  Hadoop  Data  Science  Flow   •  Join   •  Sample   •  Model   •  Repeat   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   11  
  • 12. Join   •  Put  two  pieces  of  data  together  using  a   common  key   •  Scenario:   –  Data  is  in  two  flat  files  in  HDFS   –  Turn  rows  into  rows  of  key-­‐value  pairs,  where  the   key  is  the  join  key  and  the  value  is  the  rest  of  the   row   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   12  
  • 13. Sample   •  Take  a  sample  of  your  (maybe)  joined  data   •  Most  common  method  is  probabilis?cally   •  Numerous  other  techniques  can  leverage  par??ons   and  randomness  of  the  key  hash   •  Scenarios  (a  precursor  for):   –  Supervised  learning/classifica?on   –  Unsupervised  learning/clustering   –  Regression   –  Distribu?on  modeling   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   13  
  • 14. Model   •  Supervised  learning:  I  want  to  predict  something  and   I  already  know  (some)  of  the  answers.  Also  called   classifica?on  and  binary  classifica?on   •  Unsupervised  learning:  I  want  to  find  natural   groupings  in  the  data  that  I  might  not  have  known   about   •  Regression,  probability  modeling  –  I  want  to  fit  a   curve  to  my  data   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   14  
  • 15. Repeat   •  Gain  insight  about  the  data   •  Change  your  procedure  (select  only  outliers,   etc.)   •  Gain  more  insight   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   15  
  • 16. Rhadoop  Impact:  Join,  Sample   •  Work  totally  in  R   •  Execute  large,  complex  joins  such  as  cross   joins   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   16  
  • 17. Rhadoop  Impact:  Model   •  Most  algorithms  work  perfectly  well  (or   beCer)  over  a  sample  of  the  data   •  Train  and  cross-­‐validate  a  large  number  of   models  in  parallel   •  Perform  model  selec?on  in  the  reduce  phase   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   17  
  • 18. Rhadoop  API   mapreduce(! input,! output = NULL,! map = to.map(identity),! reduce = NULL,! combine = NULL,! reduce.on.data.frame = FALSE,! input.format = "native",! output.format = "native",! vectorized = list(map = FALSE, reduce = FALSE),! structured = list(map = FALSE, reduce = FALSE),! backend.parameters = list(),! verbose = TRUE)! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   18  
  • 19. Rhadoop  API   rmr.options.set(backend = c("hadoop", "local"),! profile.nodes = NULL, vectorized.nrows = NULL)
 ! to.dfs(object, output = dfs.tempfile(), ! format = "native")! ! from.dfs(input, format = "native", ! to.data.frame = FALSE, vectorized = FALSE,! structured = FALSE)   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   19  
  • 20. Doing  Things  the  R  Way   •  Objects   –  my_car = list(color=“green”, model=“volt”)! •  Transforming a vector (list), iterating –  lapply/sapply/tapply – functional programming constructs •  Loops (not preferred) –  for ( i in 1:100) {…}! –  Note this is the same as lapply(1:100, function(i){…})! •  Other control structures – basically as you would expect This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   20  
  • 21. Vectors  in  R   •  R  helps  you!  O_o   •  Every  object  has  a  mode  and  length  and  hence  can  be  interpreted  as  some   sort  of  vector  –  even  primi?ves!   •  Even  primi?ves  such  as  strings  or  integers  are  stored  in  a  vector  of  length   1,  never  free-­‐standing   •  There  are  lots  of  types  of  vectors   –  Lists  (think  linked  list)   –  Atomic  vectors  (think  array)   hCp://cran.r-­‐project.org/doc/manuals/R-­‐intro.html#The-­‐intrinsic-­‐aCributes-­‐ mode-­‐and-­‐length   •  Type  coercion  usually  works  the  way  you  would  expect   –  But…  you  may  find  yourself  using  as.list()  or  as.vector()  or  doing  manual  coercion   frequently  depending  on  what  libraries  you’re  using  due  to  mode  not  matching   This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   21  
  • 22. Example  –  Fake  Data   fakedata = data.frame(x = c(rnorm(100)*.25, rep(. 75,100)+rnorm(100)*.25), y = c(rnorm(100), rep(1,100)+rnorm(100)), z = c(rep(0,100), rep(1,100)) )! ! plot(fakedata[,"x"],fakedata[,"y"],col=sapply(fakedata[,"z"], function(z) ifelse(z>0,"blue","green")))! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   22  
  • 23. Examples  –  Simple  Parallelism   rmr.options.set(backend=“local”)! ! ints = to.dfs(1:100)! ! squares = mapreduce(ints, map=function(x) reyval(NULL,x^2))! ! print from.dfs(ints)! ! # notice the result will be ! # keyvals! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   23  
  • 24. Examples  –  Trying  Lots  of  SVM  Kernels   kernels = to.dfs(list("linear","polynomial","radial","sigmoid" ))! ! models = from.dfs(mapreduce(kernels,map=function(nothing,kern ) keyval(NULL,svm(factor(z)~.,fakedata,kernel=kern))))! ! plot(models[[1]][["val"]],fakedata)! ! ! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   24  
  • 25. Examples  –  Different  Models   calls = to.dfs(list(list("glm",z~.,family=binomial("logi t"), fakedata),list("svm",z~.,fakedata)))! ! models = from.dfs(mapreduce(calls, map=function(nothing,callsig) keyval(NULL,do.call(callsig[[1]],callsig[2:lengt h(callsig)]))))! ! models[[1]][["val"]]! This  document  is  company  confiden?al  and  is  intended  solely  for  the  use  and  informa?on  of  Booz  Allen  Hamilton   25