Introduction to Hadoop
Upcoming SlideShare
Loading in...5
×
 

Introduction to Hadoop

on

  • 1,071 views

 

Statistics

Views

Total Views
1,071
Views on SlideShare
1,071
Embed Views
0

Actions

Likes
2
Downloads
62
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to Hadoop Introduction to Hadoop Presentation Transcript

  • Hadoop  –  Taming  Big  Data  Jax  ArcSig,  June  2012  Ovidiu  Dimulescu  
  • About  @odimulescu  •  Working  on  the  Web  since  1997  •  Likes  stuff  well  done  •  Into  engineering  cultures  and  all  around  automaEon  •  Speaker  at  local  user  groups  •  Organizer  for  the  local  Mobile  User  Group  jaxmug.com  
  • Agenda  •  IntroducEon  •  Use  cases  •  Architecture  •  MapReduce  Examples  •  Q&A  
  • What  is                                          ?  •  Apache  Hadoop  is  an  open  source  Java  soSware   framework  for  running  data-­‐intensive  applicaEons   on  large  clusters  of  commodity  hardware  •  Created  by  Doug  CuVng  (Lucene  &  Nutch  creator)  •  Named  aSer  Doug’s  son’s  toy  elephant  
  • What  and  how  is  solving?    •  Processing  diverse  large  datasets  in  pracAcal  Ame  at   low  cost    •  Consolidates  data  in  a  distributed  file  system  •  Moves  computaAon  to  data  rather  then  data  to   computaEon  •  Simpler  programming  model  
  • Why  does  it  maEer?    •  Volume,  Velocity,  Variety  and  Value  •  Datasets  do  not  fit  on  local  HDDs  let  alone  RAM  •  Data  grows  at  tremendous  pace  •  Data  is  heterogeneous    •  Scaling  up  is  expensive  (licensing,  cpus,  disks,   interconnects,  etc.)  •  Scaling  up  has  a  ceiling  (physical,  technical,  etc.)  
  • Why  does  it  maEer?   Data  types   Complex  Data     Images,  Video   20%   Logs   Documents   Call  records   Sensor  data   Mail  archives     80%   Structured  Data     User  Profiles   Complex   Structured   CRM   HR  Records  *  Chart  Source:  IDC  White  Paper  
  • Why  does  it  maEer?    •  Need  to  process  a  10TB  dataset  •  Assume  sustained  transfer  of  75MB/s  •  On  1  node  -­‐  Scanning  data  ~  2  days    •  On  10  node  cluster  -­‐  Scanning  data  ~  5  hrs  •  Low  $/TB  for  commodity  drives  •  Low-­‐end  servers  are  mulEcore  capable  
  • Use  Cases    •  ETL  -­‐  Extract  Transform  Load  •  RecommendaEon  Engines  •  Customer  Churn  Analysis    •  Ad  TargeEng    •  Data  “sandbox”  
  • Use  Cases  -­‐  Typical  ETL   Data  Warehouse   BI   ApplicaAons   Live  DB   ETL  1   ETL  2   ReporAng   DB   Logs  
  • Use  Cases  -­‐  Hadoop  ETL   Data  Warehouse   BI   ApplicaAons  Live  DB   Data  Loading   Data  Loading   ReporAng   Hadoop   DB   Logs  
  • Use  Cases  –  Analysis  methods  •  Pakern  recogniEon  •  Index  building  •  Text  mining  •  CollaboraEve  filtering  •  PredicEon  models  •  SenEment  analysis  •  Graphs  creaEon  and  traversal  
  • Who  uses  it?  
  • Who  supports  it?  
  • Why  use  Hadoop?  •  PracEcal  to  do  things  that  were  previously  not   ü  Shorter  execuEon  Eme     ü  Costs  less   ü  Simpler  programming  model    •  Open  system  with  greater  flexibility  •  Large  and  growing  ecosystem  
  • Hadoop  –  Silver  bullet?  •  Not  a  database  replacement  •  Not  a  data  warehousing  (complements  it)  •  Not  for  interacEve  reporEng    •  Not  a  general  purpose  storage  mechanism  •  Not  for  problems  that  are  not  parallelizable  in  a   share-­‐nothing  fashion  
  • Architecture  –  Design  Axioms  •  System  Shall  Manage  and  Heal  Itself  •  Performance  Shall  Scale  Linearly    •  Compute  Should  Move  to  Data    •  Simple  Core,  Modular  and  Extensible  
  • Architecture  –  Core  Components  HDFS    Distributed  filesystem  designed  for  low  cost  storage  and  high  bandwidth  access  across  the  cluster.  Map-­‐Reduce    Programming  model  for  processing  and  generaEng  large  data  sets.  
  • Architecture  –  Official  Extensions   Management   ZooKeeper   Chukwa   Data  Access   Pig  (Data  Flow)   Hive  (SQL)   Avro   Data  Processing   MapReduce  Framework   Storage   HDFS   HBase  
  • Architecture  –  CDH  DistribuAon  1.  CDH  –  Cloudera’s  DistribuEon  of  Hadoop  2.  Image  credit  -­‐  Cloudera  presentaEon  @  Microstrategy  World  2011  
  • HDFS  -­‐  Design  •  Based  on  Google’s  GFS  •  Files  are  stored  as  blocks  (64MB  default  size)    •  Configurable  data  replicaEon  (3x,  Rack  Aware)    •  Fault  Tolerant,  Expects  HW  failures  •  HUGE  files,  Expects  Streaming  not  Low  Latency  •  Mostly  WORM  
  • HDFS  -­‐  Architecture   Namenode  (NN)  Client  ask  NN  for  file   H    NN  returns  DNs  that   D  host  it   F    Client  ask  DN  for  data   S   Datanode  1   Datanode  2   Datanode  N  Namenode  -­‐  Master   Datanode  -­‐  Slaves      •  Filesystem  metadata   •  Reads  /  Write  blocks  to/from  clients  •  Controls  read/write  to  files   •  Replicates  blocks  at  master’s  request  •  Manages  blocks  replicaEon  •  Applies  transacEon  log  on  startup    
  • HDFS  –  Fault  tolerance  •  DataNode     §  Uses  CRC  to  avoid  corrupEon   §  Data  is  replicated  on  other  nodes  (3x)    •  NameNode     §  Checkpoint  NameNode   §  Backup  NameNode     §  Failover  is  manual  
  • MapReduce  -­‐  Design  •  Based  on  Google’s  MR  paper  •  Borrows  from  funcEonal  programming  •  Simpler  programming  model     §  map  (in_key,  in_value)    -­‐>  (out_key,  intermediate_value)  list   §  reduce  (out_key,  intermediate_value  list)     -­‐>  out_value  list    •  No  user  synchronizaEon  and  coordinaEon   Input  -­‐>  Map  -­‐>  Reduce  -­‐>  Output  
  • MapReduce  -­‐  Architecture  Client  launches  a  job   J   JobsTracker  (JT)     O   -­‐  ConfiguraEon   -­‐  Mapper   B   -­‐  Reducer   S   -­‐  Input     -­‐  Output   TaskTracker  1   TaskTracker  2   TaskTracker  N   API  JobTracker  -­‐  Master   TaskTracker  -­‐  Slaves      •  Accepts  MR  jobs  submiked  by  clients   •  Run  Map  and  Reduce  tasks  received  •  Assigns  Map  and  Reduce  tasks  to   from  Jobtracker     TaskTrackers,  data  locality  aware   •  Manage  storage  and  transmission  of  •  Monitors  tasks  and  TaskTracker  status,   intermediate  output   re-­‐executes  tasks  upon  failure    •  SpeculaEve  execuEon  
  • Hadoop  -­‐  Core  Architecture   J   JobsTracker   O   B   S     API   TaskTracker  1   TaskTracker  2   TaskTracker  N   DataNode        1   DataNode        2   DataNode        N   H   D   F   S   NameNode  Mini  OS   •  File  system   •  Scheduler  
  • MapReduce  –  Head  First  Style  hkp://www.slideshare.net/esaliya/mapreduce-­‐in-­‐simple-­‐terms  
  • MapReduce  –  Mapper  Types  One-­‐to-­‐One   map(k,  v)  =  emit  (k,  transform(v))    Exploder   map(k,  v)  =  foreach  p  in  v:  emit  (k,  p)    Filter   map(k,  v)  =  if  cond(v)  then  emit  (k,  v)  
  • MapReduce  –  Reducer  Types  Sum  Reducer     reduce(k,  vals)  =     sum  =  0   foreach  v  in  vals:  sum  +=  v   emit  (k,  sum)    
  • MapReduce  –  High  level  pipeline   K1   K2   K1   K1   K2   K2   K1   K2  
  • MapReduce  –  Detailed  pipeline  Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  
  • MapReduce  –  Combiner  Phase  •  OpEonal  •  Runs  on  mapper  nodes  aSer  map  phase    •  “  Mini-­‐reduce,”  only  on  local  map  output    •  Used  to  save  bandwidth  before  sending  data  to  full  reducer    •  The  Reducer  can  be  Combiner  if     1.  Output  key,  values  are  the  same  as  input  key,  values   2.  CommutaEve  and  AssociaEve  (SUM,  MAX  ok  but  AVG  not)  Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  
  • InstallaAon  1.  Download  &  configure  single-­‐node  cluster   hadoop.apache.org/common/releases.html  2.  Download  a  demo  VM     Cloudera   Hortonwork  3.  Use  a  hosted  environment  (Amazon’s  EMR,  Azure)  
  • InstallaAon  –  Pla[orm  Notes  ProducAon        Linux  –  Official    Development        Linux      OSX      Windows  via  Cygwin      *Nix  
  • MapReduce  –  Client  Languages  Java,  Any  JVM  Languages  -­‐  NaEve     hadoop  jar  jar_path  main_class  input_path  output_path    C++  -­‐  Pipes  framework  –  Socket  IO     hadoop  pipes  -­‐input  path_in  -­‐output  path_out  -­‐program  exec_program    Any  –  Streaming  –  Stdin  /  Stdout       hadoop  jar  hadoop-­‐streaming.jar  -­‐mapper  map_prog  -­‐reducer  reduce_prog  -­‐input   path_in  -­‐output  path_out      Pig  LaEn,  Hive  HQL,  C  via  JNI  
  • MapReduce  –  Client  Anatomy  •  Main  Program  (aka  Driver)      Configures  the  Job    IniEates  the  Job  •  Input  LocaEon  •  Mapper  •  Combiner  (opEonal)  •  Reducer  •  Output  LocaEon  
  • MapReduce  –  Word  Count  Example  
  • MapReduce  –  C#  Mapper  
  • MapReduce  –  C#  Reducer  
  • MapReduce  –  Java  Mapper  
  • MapReduce  –  Java  Reducer  
  • MapReduce  –  JavaScript  Mapper  
  • MapReduce  –  JavaScript  Reducer  
  • Summary                                                          is  an  economical  scalable  distributed  data  processing  system  which  enables  data:     ü  ConsolidaAon  (Structured  or  Not)   ü  Query  Flexibility  (Any  Language)   ü  Agility  (Evolving  Schemas)  
  • QuesAons  ?  
  • References  Hadoop  at  Yahoo!,  by  Y!  Developer  Network    MapReduce  in  Simple  Terms,  by Saliya Ekanayake    Hadoop  Architecture,  by Phillipe Julio    10  Hadoop-­‐able  Problems,  by Cloudera    Hadoop,  An  Industry  PerspecEve,  by Amr AwadallahAnatomy of a MapReduce Job Run by Tom WhiteMapReduceJobs in Hadoop