Successfully reported this slideshow.
Your SlideShare is downloading. ×

Introduction to Hadoop

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 46 Ad

More Related Content

Slideshows for you (20)

Similar to Introduction to Hadoop (20)

Advertisement

Recently uploaded (20)

Advertisement

Introduction to Hadoop

  1. 1. Hadoop  –  Taming  Big  Data   Jax  ArcSig,  June  2012   Ovidiu  Dimulescu  
  2. 2. About  @odimulescu   •  Working  on  the  Web  since  1997   •  Likes  stuff  well  done   •  Into  engineering  cultures  and  all  around  automaEon   •  Speaker  at  local  user  groups   •  Organizer  for  the  local  Mobile  User  Group  jaxmug.com  
  3. 3. Agenda   •  IntroducEon   •  Use  cases   •  Architecture   •  MapReduce  Examples   •  Q&A  
  4. 4. What  is                                          ?   •  Apache  Hadoop  is  an  open  source  Java  soSware   framework  for  running  data-­‐intensive  applicaEons   on  large  clusters  of  commodity  hardware   •  Created  by  Doug  CuVng  (Lucene  &  Nutch  creator)   •  Named  aSer  Doug’s  son’s  toy  elephant  
  5. 5. What  and  how  is  solving?     •  Processing  diverse  large  datasets  in  pracAcal  Ame  at   low  cost     •  Consolidates  data  in  a  distributed  file  system   •  Moves  computaAon  to  data  rather  then  data  to   computaEon   •  Simpler  programming  model  
  6. 6. Why  does  it  maEer?     •  Volume,  Velocity,  Variety  and  Value   •  Datasets  do  not  fit  on  local  HDDs  let  alone  RAM   •  Data  grows  at  tremendous  pace   •  Data  is  heterogeneous     •  Scaling  up  is  expensive  (licensing,  cpus,  disks,   interconnects,  etc.)   •  Scaling  up  has  a  ceiling  (physical,  technical,  etc.)  
  7. 7. Why  does  it  maEer?   Data  types   Complex  Data     Images,  Video   20%   Logs   Documents   Call  records   Sensor  data   Mail  archives     80%   Structured  Data     User  Profiles   Complex   Structured   CRM   HR  Records   *  Chart  Source:  IDC  White  Paper  
  8. 8. Why  does  it  maEer?     •  Need  to  process  a  10TB  dataset   •  Assume  sustained  transfer  of  75MB/s   •  On  1  node  -­‐  Scanning  data  ~  2  days     •  On  10  node  cluster  -­‐  Scanning  data  ~  5  hrs   •  Low  $/TB  for  commodity  drives   •  Low-­‐end  servers  are  mulEcore  capable  
  9. 9. Use  Cases     •  ETL  -­‐  Extract  Transform  Load   •  RecommendaEon  Engines   •  Customer  Churn  Analysis     •  Ad  TargeEng     •  Data  “sandbox”  
  10. 10. Use  Cases  -­‐  Typical  ETL   Data  Warehouse   BI   ApplicaAons   Live  DB   ETL  1   ETL  2   ReporAng   DB   Logs  
  11. 11. Use  Cases  -­‐  Hadoop  ETL   Data  Warehouse   BI   ApplicaAons   Live  DB   Data  Loading   Data  Loading   ReporAng   Hadoop   DB   Logs  
  12. 12. Use  Cases  –  Analysis  methods   •  Pakern  recogniEon   •  Index  building   •  Text  mining   •  CollaboraEve  filtering   •  PredicEon  models   •  SenEment  analysis   •  Graphs  creaEon  and  traversal  
  13. 13. Who  uses  it?  
  14. 14. Who  supports  it?  
  15. 15. Why  use  Hadoop?   •  PracEcal  to  do  things  that  were  previously  not   ü  Shorter  execuEon  Eme     ü  Costs  less   ü  Simpler  programming  model     •  Open  system  with  greater  flexibility   •  Large  and  growing  ecosystem  
  16. 16. Hadoop  –  Silver  bullet?   •  Not  a  database  replacement   •  Not  a  data  warehousing  (complements  it)   •  Not  for  interacEve  reporEng     •  Not  a  general  purpose  storage  mechanism   •  Not  for  problems  that  are  not  parallelizable  in  a   share-­‐nothing  fashion  
  17. 17. Architecture  –  Design  Axioms   •  System  Shall  Manage  and  Heal  Itself   •  Performance  Shall  Scale  Linearly     •  Compute  Should  Move  to  Data     •  Simple  Core,  Modular  and  Extensible  
  18. 18. Architecture  –  Core  Components   HDFS     Distributed  filesystem  designed  for  low  cost  storage   and  high  bandwidth  access  across  the  cluster.   Map-­‐Reduce     Programming  model  for  processing  and  generaEng   large  data  sets.  
  19. 19. Architecture  –  Official  Extensions   Management   ZooKeeper   Chukwa   Data  Access   Pig  (Data  Flow)   Hive  (SQL)   Avro   Data  Processing   MapReduce  Framework   Storage   HDFS   HBase  
  20. 20. Architecture  –  CDH  DistribuAon   1.  CDH  –  Cloudera’s  DistribuEon  of  Hadoop   2.  Image  credit  -­‐  Cloudera  presentaEon  @  Microstrategy  World  2011  
  21. 21. HDFS  -­‐  Design   •  Based  on  Google’s  GFS   •  Files  are  stored  as  blocks  (64MB  default  size)     •  Configurable  data  replicaEon  (3x,  Rack  Aware)     •  Fault  Tolerant,  Expects  HW  failures   •  HUGE  files,  Expects  Streaming  not  Low  Latency   •  Mostly  WORM  
  22. 22. HDFS  -­‐  Architecture   Namenode  (NN)   Client  ask  NN  for  file   H     NN  returns  DNs  that   D   host  it   F     Client  ask  DN  for  data   S   Datanode  1   Datanode  2   Datanode  N   Namenode  -­‐  Master   Datanode  -­‐  Slaves       •  Filesystem  metadata   •  Reads  /  Write  blocks  to/from  clients   •  Controls  read/write  to  files   •  Replicates  blocks  at  master’s  request   •  Manages  blocks  replicaEon   •  Applies  transacEon  log  on  startup    
  23. 23. HDFS  –  Fault  tolerance   •  DataNode     §  Uses  CRC  to  avoid  corrupEon   §  Data  is  replicated  on  other  nodes  (3x)     •  NameNode     §  Checkpoint  NameNode   §  Backup  NameNode     §  Failover  is  manual  
  24. 24. MapReduce  -­‐  Design   •  Based  on  Google’s  MR  paper   •  Borrows  from  funcEonal  programming   •  Simpler  programming  model     §  map  (in_key,  in_value)    -­‐>  (out_key,  intermediate_value)  list   §  reduce  (out_key,  intermediate_value  list)     -­‐>  out_value  list     •  No  user  synchronizaEon  and  coordinaEon   Input  -­‐>  Map  -­‐>  Reduce  -­‐>  Output  
  25. 25. MapReduce  -­‐  Architecture   Client  launches  a  job   J   JobsTracker  (JT)     O   -­‐  ConfiguraEon   -­‐  Mapper   B   -­‐  Reducer   S   -­‐  Input     -­‐  Output   TaskTracker  1   TaskTracker  2   TaskTracker  N   API   JobTracker  -­‐  Master   TaskTracker  -­‐  Slaves       •  Accepts  MR  jobs  submiked  by  clients   •  Run  Map  and  Reduce  tasks  received   •  Assigns  Map  and  Reduce  tasks  to   from  Jobtracker     TaskTrackers,  data  locality  aware   •  Manage  storage  and  transmission  of   •  Monitors  tasks  and  TaskTracker  status,   intermediate  output   re-­‐executes  tasks  upon  failure     •  SpeculaEve  execuEon  
  26. 26. Hadoop  -­‐  Core  Architecture   J   JobsTracker   O   B   S     API   TaskTracker  1   TaskTracker  2   TaskTracker  N   DataNode        1   DataNode        2   DataNode        N   H   D   F   S   NameNode   Mini  OS   •  File  system   •  Scheduler  
  27. 27. MapReduce  –  Head  First  Style   hkp://www.slideshare.net/esaliya/mapreduce-­‐ in-­‐simple-­‐terms  
  28. 28. MapReduce  –  Mapper  Types   One-­‐to-­‐One   map(k,  v)  =  emit  (k,  transform(v))     Exploder   map(k,  v)  =  foreach  p  in  v:  emit  (k,  p)     Filter   map(k,  v)  =  if  cond(v)  then  emit  (k,  v)  
  29. 29. MapReduce  –  Reducer  Types   Sum  Reducer     reduce(k,  vals)  =     sum  =  0   foreach  v  in  vals:  sum  +=  v   emit  (k,  sum)    
  30. 30. MapReduce  –  High  level  pipeline   K1   K2   K1   K1   K2   K2   K1   K2  
  31. 31. MapReduce  –  Detailed  pipeline   Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  
  32. 32. MapReduce  –  Combiner  Phase   •  OpEonal   •  Runs  on  mapper  nodes  aSer  map  phase     •  “  Mini-­‐reduce,”  only  on  local  map  output     •  Used  to  save  bandwidth  before  sending  data  to  full  reducer     •  The  Reducer  can  be  Combiner  if     1.  Output  key,  values  are  the  same  as  input  key,  values   2.  CommutaEve  and  AssociaEve  (SUM,  MAX  ok  but  AVG  not)   Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  
  33. 33. InstallaAon   1.  Download  &  configure  single-­‐node  cluster   hadoop.apache.org/common/releases.html   2.  Download  a  demo  VM     Cloudera   Hortonwork   3.  Use  a  hosted  environment  (Amazon’s  EMR,  Azure)  
  34. 34. InstallaAon  –  Pla[orm  Notes   ProducAon        Linux  –  Official     Development        Linux      OSX      Windows  via  Cygwin      *Nix  
  35. 35. MapReduce  –  Client  Languages   Java,  Any  JVM  Languages  -­‐  NaEve     hadoop  jar  jar_path  main_class  input_path  output_path     C++  -­‐  Pipes  framework  –  Socket  IO     hadoop  pipes  -­‐input  path_in  -­‐output  path_out  -­‐program  exec_program     Any  –  Streaming  –  Stdin  /  Stdout       hadoop  jar  hadoop-­‐streaming.jar  -­‐mapper  map_prog  -­‐reducer  reduce_prog  -­‐input   path_in  -­‐output  path_out       Pig  LaEn,  Hive  HQL,  C  via  JNI  
  36. 36. MapReduce  –  Client  Anatomy   •  Main  Program  (aka  Driver)      Configures  the  Job    IniEates  the  Job   •  Input  LocaEon   •  Mapper   •  Combiner  (opEonal)   •  Reducer   •  Output  LocaEon  
  37. 37. MapReduce  –  Word  Count  Example  
  38. 38. MapReduce  –  C#  Mapper  
  39. 39. MapReduce  –  C#  Reducer  
  40. 40. MapReduce  –  Java  Mapper  
  41. 41. MapReduce  –  Java  Reducer  
  42. 42. MapReduce  –  JavaScript  Mapper  
  43. 43. MapReduce  –  JavaScript  Reducer  
  44. 44. Summary                                                          is  an  economical  scalable  distributed   data  processing  system  which  enables  data:     ü  ConsolidaAon  (Structured  or  Not)   ü  Query  Flexibility  (Any  Language)   ü  Agility  (Evolving  Schemas)  
  45. 45. QuesAons  ?  
  46. 46. References   Hadoop  at  Yahoo!,  by  Y!  Developer  Network     MapReduce  in  Simple  Terms,  by Saliya Ekanayake     Hadoop  Architecture,  by Phillipe Julio     10  Hadoop-­‐able  Problems,  by Cloudera     Hadoop,  An  Industry  PerspecEve,  by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop

×