SlideShare a Scribd company logo
1 of 46
Download to read offline
Hadoop	
  –	
  Taming	
  Big	
  Data	
  
Jax	
  ArcSig,	
  June	
  2012	
  



Ovidiu	
  Dimulescu	
  
About	
  @odimulescu	
  
•    Working	
  on	
  the	
  Web	
  since	
  1997	
  
•    Likes	
  stuff	
  well	
  done	
  
•    Into	
  engineering	
  cultures	
  and	
  all	
  around	
  automaEon	
  
•    Speaker	
  at	
  local	
  user	
  groups	
  
•    Organizer	
  for	
  the	
  local	
  Mobile	
  User	
  Group	
  jaxmug.com	
  
Agenda	
  

•  IntroducEon	
  
•  Use	
  cases	
  
•  Architecture	
  
•  MapReduce	
  Examples	
  
•  Q&A	
  
What	
  is	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ?	
  
•  Apache	
  Hadoop	
  is	
  an	
  open	
  source	
  Java	
  soSware	
  
   framework	
  for	
  running	
  data-­‐intensive	
  applicaEons	
  
   on	
  large	
  clusters	
  of	
  commodity	
  hardware	
  

•  Created	
  by	
  Doug	
  CuVng	
  (Lucene	
  &	
  Nutch	
  creator)	
  

•  Named	
  aSer	
  Doug’s	
  son’s	
  toy	
  elephant	
  
What	
  and	
  how	
  is	
  solving?	
  
	
  



•  Processing	
  diverse	
  large	
  datasets	
  in	
  pracAcal	
  Ame	
  at	
  
     low	
  cost	
  
	
  
•  Consolidates	
  data	
  in	
  a	
  distributed	
  file	
  system	
  

•  Moves	
  computaAon	
  to	
  data	
  rather	
  then	
  data	
  to	
  
   computaEon	
  

•  Simpler	
  programming	
  model	
  
Why	
  does	
  it	
  maEer?	
  
	
  



•  Volume,	
  Velocity,	
  Variety	
  and	
  Value	
  

•  Datasets	
  do	
  not	
  fit	
  on	
  local	
  HDDs	
  let	
  alone	
  RAM	
  

•  Data	
  grows	
  at	
  tremendous	
  pace	
  

•  Data	
  is	
  heterogeneous	
  
	
  
•  Scaling	
  up	
  is	
  expensive	
  (licensing,	
  cpus,	
  disks,	
  
     interconnects,	
  etc.)	
  

•  Scaling	
  up	
  has	
  a	
  ceiling	
  (physical,	
  technical,	
  etc.)	
  
Why	
  does	
  it	
  maEer?	
  
                       Data	
  types	
                 Complex	
  Data	
  
                                                       	
  
                                                              Images,	
  Video	
  
                   20%	
                                      Logs	
  
                                                              Documents	
  
                                                              Call	
  records	
  
                                                              Sensor	
  data	
  
                                                              Mail	
  archives	
  
                                                       	
  
                                           80%	
  
                                                       Structured	
  Data	
  
                                                       	
  
                                                            User	
  Profiles	
  
                  Complex	
           Structured	
          CRM	
  
                                                            HR	
  Records	
  
*	
  Chart	
  Source:	
  IDC	
  White	
  Paper	
  
Why	
  does	
  it	
  maEer?	
  
	
  



•  Need	
  to	
  process	
  a	
  10TB	
  dataset	
  

•  Assume	
  sustained	
  transfer	
  of	
  75MB/s	
  

•  On	
  1	
  node	
  -­‐	
  Scanning	
  data	
  ~	
  2	
  days	
  	
  

•  On	
  10	
  node	
  cluster	
  -­‐	
  Scanning	
  data	
  ~	
  5	
  hrs	
  

•  Low	
  $/TB	
  for	
  commodity	
  drives	
  

•  Low-­‐end	
  servers	
  are	
  mulEcore	
  capable	
  
Use	
  Cases	
  
	
  



•  ETL	
  -­‐	
  Extract	
  Transform	
  Load	
  

•  RecommendaEon	
  Engines	
  

•  Customer	
  Churn	
  Analysis	
  
	
  
•  Ad	
  TargeEng	
  
	
  
•  Data	
  “sandbox”	
  
Use	
  Cases	
  -­‐	
  Typical	
  ETL	
  
                                            Data	
  Warehouse	
  


                                                    BI	
  
                                               ApplicaAons	
  
    Live	
  DB	
        ETL	
  1	
  


                        ETL	
  2	
              ReporAng	
  
                                                   DB	
  
      Logs	
  
Use	
  Cases	
  -­‐	
  Hadoop	
  ETL	
  
                                                         Data	
  Warehouse	
  


                                                                  BI	
  
                                                             ApplicaAons	
  
Live	
  DB	
  

    Data	
  Loading	
                  Data	
  Loading	
      ReporAng	
  
                          Hadoop	
  
                                                                 DB	
  

  Logs	
  
Use	
  Cases	
  –	
  Analysis	
  methods	
  

•  Pakern	
  recogniEon	
  
•  Index	
  building	
  
•  Text	
  mining	
  
•  CollaboraEve	
  filtering	
  
•  PredicEon	
  models	
  
•  SenEment	
  analysis	
  
•  Graphs	
  creaEon	
  and	
  traversal	
  
Who	
  uses	
  it?	
  
Who	
  supports	
  it?	
  
Why	
  use	
  Hadoop?	
  
•  PracEcal	
  to	
  do	
  things	
  that	
  were	
  previously	
  not	
  

    ü  Shorter	
  execuEon	
  Eme	
  
    	
  
    ü  Costs	
  less	
  

    ü  Simpler	
  programming	
  model	
  
	
  
•  Open	
  system	
  with	
  greater	
  flexibility	
  

•  Large	
  and	
  growing	
  ecosystem	
  
Hadoop	
  –	
  Silver	
  bullet?	
  
•  Not	
  a	
  database	
  replacement	
  

•  Not	
  a	
  data	
  warehousing	
  (complements	
  it)	
  

•  Not	
  for	
  interacEve	
  reporEng	
  
	
  
•  Not	
  a	
  general	
  purpose	
  storage	
  mechanism	
  

•  Not	
  for	
  problems	
  that	
  are	
  not	
  parallelizable	
  in	
  a	
  
   share-­‐nothing	
  fashion	
  
Architecture	
  –	
  Design	
  Axioms	
  
•  System	
  Shall	
  Manage	
  and	
  Heal	
  Itself	
  

•  Performance	
  Shall	
  Scale	
  Linearly	
  	
  

•  Compute	
  Should	
  Move	
  to	
  Data	
  
	
  
•  Simple	
  Core,	
  Modular	
  and	
  Extensible	
  
Architecture	
  –	
  Core	
  Components	
  
HDFS	
  
	
  
Distributed	
  filesystem	
  designed	
  for	
  low	
  cost	
  storage	
  
and	
  high	
  bandwidth	
  access	
  across	
  the	
  cluster.	
  

Map-­‐Reduce	
  
	
  
Programming	
  model	
  for	
  processing	
  and	
  generaEng	
  
large	
  data	
  sets.	
  
Architecture	
  –	
  Official	
  Extensions	
  
                                      Management	
  
                  ZooKeeper	
                                  Chukwa	
  


                                       Data	
  Access	
  
  Pig	
  (Data	
  Flow)	
              Hive	
  (SQL)	
                      Avro	
  


                                     Data	
  Processing	
  
                                  MapReduce	
  Framework	
  


                                         Storage	
  
                      HDFS	
                                    HBase	
  
Architecture	
  –	
  CDH	
  DistribuAon	
  




1.  CDH	
  –	
  Cloudera’s	
  DistribuEon	
  of	
  Hadoop	
  
2.  Image	
  credit	
  -­‐	
  Cloudera	
  presentaEon	
  @	
  Microstrategy	
  World	
  2011	
  
HDFS	
  -­‐	
  Design	
  
•  Based	
  on	
  Google’s	
  GFS	
  
•  Files	
  are	
  stored	
  as	
  blocks	
  (64MB	
  default	
  size)	
  
	
  
•  Configurable	
  data	
  replicaEon	
  (3x,	
  Rack	
  Aware)	
  
	
  
•  Fault	
  Tolerant,	
  Expects	
  HW	
  failures	
  
•  HUGE	
  files,	
  Expects	
  Streaming	
  not	
  Low	
  Latency	
  
•  Mostly	
  WORM	
  
HDFS	
  -­‐	
  Architecture	
  
                                                                         Namenode	
  (NN)	
  
Client	
  ask	
  NN	
  for	
  file	
      H	
  
	
  
NN	
  returns	
  DNs	
  that	
           D	
  
host	
  it	
                             F	
  
	
  
Client	
  ask	
  DN	
  for	
  data	
     S	
  
                                                 Datanode	
  1	
            Datanode	
  2	
             Datanode	
  N	
  



Namenode	
  -­‐	
  Master	
                                          Datanode	
  -­‐	
  Slaves	
  
	
                                                                   	
  
•    Filesystem	
  metadata	
                                        •    Reads	
  /	
  Write	
  blocks	
  to/from	
  clients	
  
•    Controls	
  read/write	
  to	
  files	
                          •    Replicates	
  blocks	
  at	
  master’s	
  request	
  
•    Manages	
  blocks	
  replicaEon	
  
•    Applies	
  transacEon	
  log	
  on	
  startup	
  
	
  
HDFS	
  –	
  Fault	
  tolerance	
  
•  DataNode	
  	
  

        §  Uses	
  CRC	
  to	
  avoid	
  corrupEon	
  
        §  Data	
  is	
  replicated	
  on	
  other	
  nodes	
  (3x)	
  
	
  
•  NameNode	
  
	
  
        §  Checkpoint	
  NameNode	
  
        §  Backup	
  NameNode	
  	
  
        §  Failover	
  is	
  manual	
  
MapReduce	
  -­‐	
  Design	
  
•  Based	
  on	
  Google’s	
  MR	
  paper	
  
•  Borrows	
  from	
  funcEonal	
  programming	
  
•  Simpler	
  programming	
  model	
  
	
  
       §  map	
  (in_key,	
  in_value)	
  
           	
  -­‐>	
  (out_key,	
  intermediate_value)	
  list	
  

       §  reduce	
  (out_key,	
  intermediate_value	
  list)	
  	
  
            -­‐>	
  out_value	
  list	
  
            	
  
•  No	
  user	
  synchronizaEon	
  and	
  coordinaEon	
  
       Input	
  -­‐>	
  Map	
  -­‐>	
  Reduce	
  -­‐>	
  Output	
  
MapReduce	
  -­‐	
  Architecture	
  
Client	
  launches	
  a	
  job	
       J	
                                JobsTracker	
  (JT)	
  
	
                                    O	
  
     -­‐	
  ConfiguraEon	
  
     -­‐	
  Mapper	
                  B	
  
     -­‐	
  Reducer	
                 S	
  
     -­‐	
  Input	
                   	
  
     -­‐	
  Output	
                           TaskTracker	
  1	
          TaskTracker	
  2	
         TaskTracker	
  N	
  
                                     API	
  



JobTracker	
  -­‐	
  Master	
                                         TaskTracker	
  -­‐	
  Slaves	
  
	
                                                                    	
  
•  Accepts	
  MR	
  jobs	
  submiked	
  by	
  clients	
               •  Run	
  Map	
  and	
  Reduce	
  tasks	
  received	
  
•  Assigns	
  Map	
  and	
  Reduce	
  tasks	
  to	
                        from	
  Jobtracker	
  	
  
     TaskTrackers,	
  data	
  locality	
  aware	
                     •  Manage	
  storage	
  and	
  transmission	
  of	
  
•  Monitors	
  tasks	
  and	
  TaskTracker	
  status,	
                    intermediate	
  output	
  
     re-­‐executes	
  tasks	
  upon	
  failure	
                      	
  
•  SpeculaEve	
  execuEon	
  
Hadoop	
  -­‐	
  Core	
  Architecture	
  
       J	
                                       JobsTracker	
  
      O	
  
      B	
  
      S	
  
      	
  
     API	
     TaskTracker	
  1	
              TaskTracker	
  2	
              TaskTracker	
  N	
  
               DataNode	
  	
  	
  	
  1	
     DataNode	
  	
  	
  	
  2	
     DataNode	
  	
  	
  	
  N	
  
                                                                                                               H	
  
                                                                                                               D	
  
                                                                                                               F	
  
                                                                                                               S	
  
                                                  NameNode	
  


Mini	
  OS	
  
   •  File	
  system	
  
   •  Scheduler	
  
MapReduce	
  –	
  Head	
  First	
  Style	
  



hkp://www.slideshare.net/esaliya/mapreduce-­‐
in-­‐simple-­‐terms	
  
MapReduce	
  –	
  Mapper	
  Types	
  
One-­‐to-­‐One	
  
    map(k,	
  v)	
  =	
  emit	
  (k,	
  transform(v))	
  
    	
  
Exploder	
  
    map(k,	
  v)	
  =	
  foreach	
  p	
  in	
  v:	
  emit	
  (k,	
  p)	
  
    	
  
Filter	
  
    map(k,	
  v)	
  =	
  if	
  cond(v)	
  then	
  emit	
  (k,	
  v)	
  
MapReduce	
  –	
  Reducer	
  Types	
  
Sum	
  Reducer	
  
	
  
     reduce(k,	
  vals)	
  =	
  	
  
          sum	
  =	
  0	
  
          foreach	
  v	
  in	
  vals:	
  sum	
  +=	
  v	
  
          emit	
  (k,	
  sum)	
  
     	
  
MapReduce	
  –	
  High	
  level	
  pipeline	
  

                K1	
  


                K2	
  
                              K1	
  


                K1	
  


                K2	
  

                              K2	
  
                K1	
  


                 K2	
  
MapReduce	
  –	
  Detailed	
  pipeline	
  




Diagram:	
  hkp://developer.yahoo.com/hadoop/tutorial/module4.html	
  
MapReduce	
  –	
  Combiner	
  Phase	
  




•    OpEonal	
  
•    Runs	
  on	
  mapper	
  nodes	
  aSer	
  map	
  phase	
  	
  
•    “	
  Mini-­‐reduce,”	
  only	
  on	
  local	
  map	
  output	
  	
  
•    Used	
  to	
  save	
  bandwidth	
  before	
  sending	
  data	
  to	
  full	
  reducer	
  	
  
•    The	
  Reducer	
  can	
  be	
  Combiner	
  if	
  	
  
          1.  Output	
  key,	
  values	
  are	
  the	
  same	
  as	
  input	
  key,	
  values	
  
          2.  CommutaEve	
  and	
  AssociaEve	
  (SUM,	
  MAX	
  ok	
  but	
  AVG	
  not)	
  
Diagram:	
  hkp://developer.yahoo.com/hadoop/tutorial/module4.html	
  
InstallaAon	
  
1.  Download	
  &	
  configure	
  single-­‐node	
  cluster	
  

    hadoop.apache.org/common/releases.html	
  

2.  Download	
  a	
  demo	
  VM	
  
	
  
      Cloudera	
  
      Hortonwork	
  

3.  Use	
  a	
  hosted	
  environment	
  (Amazon’s	
  EMR,	
  Azure)	
  
InstallaAon	
  –	
  Pla[orm	
  Notes	
  
ProducAon	
  	
  
     	
   	
  Linux	
  –	
  Official	
  
	
  
Development	
  	
  
     	
   	
  Linux	
  
     	
   	
  OSX	
  
     	
   	
  Windows	
  via	
  Cygwin	
  
     	
   	
  *Nix	
  
MapReduce	
  –	
  Client	
  Languages	
  
Java,	
  Any	
  JVM	
  Languages	
  -­‐	
  NaEve	
  
	
   hadoop	
  jar	
  jar_path	
  main_class	
  input_path	
  output_path	
  
	
  
C++	
  -­‐	
  Pipes	
  framework	
  –	
  Socket	
  IO	
  
	
   hadoop	
  pipes	
  -­‐input	
  path_in	
  -­‐output	
  path_out	
  -­‐program	
  exec_program	
  
	
  
Any	
  –	
  Streaming	
  –	
  Stdin	
  /	
  Stdout	
  
        	
  
	
   hadoop	
  jar	
  hadoop-­‐streaming.jar	
  -­‐mapper	
  map_prog	
  -­‐reducer	
  reduce_prog	
  -­‐input	
  
     path_in	
  -­‐output	
  path_out	
  
	
  
	
  
Pig	
  LaEn,	
  Hive	
  HQL,	
  C	
  via	
  JNI	
  
MapReduce	
  –	
  Client	
  Anatomy	
  
•  Main	
  Program	
  (aka	
  Driver)	
  
	
  
          	
  Configures	
  the	
  Job	
  
          	
  IniEates	
  the	
  Job	
  
•      Input	
  LocaEon	
  
•      Mapper	
  
•      Combiner	
  (opEonal)	
  
•      Reducer	
  
•      Output	
  LocaEon	
  
MapReduce	
  –	
  Word	
  Count	
  Example	
  
MapReduce	
  –	
  C#	
  Mapper	
  
MapReduce	
  –	
  C#	
  Reducer	
  
MapReduce	
  –	
  Java	
  Mapper	
  
MapReduce	
  –	
  Java	
  Reducer	
  
MapReduce	
  –	
  JavaScript	
  Mapper	
  
MapReduce	
  –	
  JavaScript	
  Reducer	
  
Summary	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  is	
  an	
  economical	
  scalable	
  distributed	
  
data	
  processing	
  system	
  which	
  enables	
  data:	
  
	
  
     ü  ConsolidaAon	
  (Structured	
  or	
  Not)	
  
     ü  Query	
  Flexibility	
  (Any	
  Language)	
  
     ü  Agility	
  (Evolving	
  Schemas)	
  
QuesAons	
  ?	
  
References	
  
Hadoop	
  at	
  Yahoo!,	
  by	
  Y!	
  Developer	
  Network	
  
	
  
MapReduce	
  in	
  Simple	
  Terms,	
  by Saliya Ekanayake	
  
	
  
Hadoop	
  Architecture,	
  by Phillipe Julio	
  
	
  
10	
  Hadoop-­‐able	
  Problems,	
  by Cloudera	
  
	
  
Hadoop,	
  An	
  Industry	
  PerspecEve,	
  by Amr Awadallah

Anatomy of a MapReduce Job Run by Tom White

MapReduceJobs in Hadoop

More Related Content

What's hot

Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 

What's hot (20)

Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
HDFS
HDFSHDFS
HDFS
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Pptx present
Pptx presentPptx present
Pptx present
 

Similar to Introduction to Hadoop

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMark Kromer
 

Similar to Introduction to Hadoop (20)

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 

More from Ovidiu Dimulescu

More from Ovidiu Dimulescu (9)

Microservices - Yet another buzzword
Microservices - Yet another buzzwordMicroservices - Yet another buzzword
Microservices - Yet another buzzword
 
Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java edition
 
Journeyman to Master
Journeyman to MasterJourneyman to Master
Journeyman to Master
 
The Rise of DevOps
The Rise of DevOpsThe Rise of DevOps
The Rise of DevOps
 
Git for Windows
Git for WindowsGit for Windows
Git for Windows
 
Node.js, toy or power tool?
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?
 
HTML5, are we there yet?
HTML5, are we there yet?HTML5, are we there yet?
HTML5, are we there yet?
 
Git SVN Migrate Reasons
Git SVN Migrate ReasonsGit SVN Migrate Reasons
Git SVN Migrate Reasons
 
Introduction to Git
Introduction to GitIntroduction to Git
Introduction to Git
 

Recently uploaded

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Introduction to Hadoop

  • 1. Hadoop  –  Taming  Big  Data   Jax  ArcSig,  June  2012   Ovidiu  Dimulescu  
  • 2. About  @odimulescu   •  Working  on  the  Web  since  1997   •  Likes  stuff  well  done   •  Into  engineering  cultures  and  all  around  automaEon   •  Speaker  at  local  user  groups   •  Organizer  for  the  local  Mobile  User  Group  jaxmug.com  
  • 3. Agenda   •  IntroducEon   •  Use  cases   •  Architecture   •  MapReduce  Examples   •  Q&A  
  • 4. What  is                                          ?   •  Apache  Hadoop  is  an  open  source  Java  soSware   framework  for  running  data-­‐intensive  applicaEons   on  large  clusters  of  commodity  hardware   •  Created  by  Doug  CuVng  (Lucene  &  Nutch  creator)   •  Named  aSer  Doug’s  son’s  toy  elephant  
  • 5. What  and  how  is  solving?     •  Processing  diverse  large  datasets  in  pracAcal  Ame  at   low  cost     •  Consolidates  data  in  a  distributed  file  system   •  Moves  computaAon  to  data  rather  then  data  to   computaEon   •  Simpler  programming  model  
  • 6. Why  does  it  maEer?     •  Volume,  Velocity,  Variety  and  Value   •  Datasets  do  not  fit  on  local  HDDs  let  alone  RAM   •  Data  grows  at  tremendous  pace   •  Data  is  heterogeneous     •  Scaling  up  is  expensive  (licensing,  cpus,  disks,   interconnects,  etc.)   •  Scaling  up  has  a  ceiling  (physical,  technical,  etc.)  
  • 7. Why  does  it  maEer?   Data  types   Complex  Data     Images,  Video   20%   Logs   Documents   Call  records   Sensor  data   Mail  archives     80%   Structured  Data     User  Profiles   Complex   Structured   CRM   HR  Records   *  Chart  Source:  IDC  White  Paper  
  • 8. Why  does  it  maEer?     •  Need  to  process  a  10TB  dataset   •  Assume  sustained  transfer  of  75MB/s   •  On  1  node  -­‐  Scanning  data  ~  2  days     •  On  10  node  cluster  -­‐  Scanning  data  ~  5  hrs   •  Low  $/TB  for  commodity  drives   •  Low-­‐end  servers  are  mulEcore  capable  
  • 9. Use  Cases     •  ETL  -­‐  Extract  Transform  Load   •  RecommendaEon  Engines   •  Customer  Churn  Analysis     •  Ad  TargeEng     •  Data  “sandbox”  
  • 10. Use  Cases  -­‐  Typical  ETL   Data  Warehouse   BI   ApplicaAons   Live  DB   ETL  1   ETL  2   ReporAng   DB   Logs  
  • 11. Use  Cases  -­‐  Hadoop  ETL   Data  Warehouse   BI   ApplicaAons   Live  DB   Data  Loading   Data  Loading   ReporAng   Hadoop   DB   Logs  
  • 12. Use  Cases  –  Analysis  methods   •  Pakern  recogniEon   •  Index  building   •  Text  mining   •  CollaboraEve  filtering   •  PredicEon  models   •  SenEment  analysis   •  Graphs  creaEon  and  traversal  
  • 15. Why  use  Hadoop?   •  PracEcal  to  do  things  that  were  previously  not   ü  Shorter  execuEon  Eme     ü  Costs  less   ü  Simpler  programming  model     •  Open  system  with  greater  flexibility   •  Large  and  growing  ecosystem  
  • 16. Hadoop  –  Silver  bullet?   •  Not  a  database  replacement   •  Not  a  data  warehousing  (complements  it)   •  Not  for  interacEve  reporEng     •  Not  a  general  purpose  storage  mechanism   •  Not  for  problems  that  are  not  parallelizable  in  a   share-­‐nothing  fashion  
  • 17. Architecture  –  Design  Axioms   •  System  Shall  Manage  and  Heal  Itself   •  Performance  Shall  Scale  Linearly     •  Compute  Should  Move  to  Data     •  Simple  Core,  Modular  and  Extensible  
  • 18. Architecture  –  Core  Components   HDFS     Distributed  filesystem  designed  for  low  cost  storage   and  high  bandwidth  access  across  the  cluster.   Map-­‐Reduce     Programming  model  for  processing  and  generaEng   large  data  sets.  
  • 19. Architecture  –  Official  Extensions   Management   ZooKeeper   Chukwa   Data  Access   Pig  (Data  Flow)   Hive  (SQL)   Avro   Data  Processing   MapReduce  Framework   Storage   HDFS   HBase  
  • 20. Architecture  –  CDH  DistribuAon   1.  CDH  –  Cloudera’s  DistribuEon  of  Hadoop   2.  Image  credit  -­‐  Cloudera  presentaEon  @  Microstrategy  World  2011  
  • 21. HDFS  -­‐  Design   •  Based  on  Google’s  GFS   •  Files  are  stored  as  blocks  (64MB  default  size)     •  Configurable  data  replicaEon  (3x,  Rack  Aware)     •  Fault  Tolerant,  Expects  HW  failures   •  HUGE  files,  Expects  Streaming  not  Low  Latency   •  Mostly  WORM  
  • 22. HDFS  -­‐  Architecture   Namenode  (NN)   Client  ask  NN  for  file   H     NN  returns  DNs  that   D   host  it   F     Client  ask  DN  for  data   S   Datanode  1   Datanode  2   Datanode  N   Namenode  -­‐  Master   Datanode  -­‐  Slaves       •  Filesystem  metadata   •  Reads  /  Write  blocks  to/from  clients   •  Controls  read/write  to  files   •  Replicates  blocks  at  master’s  request   •  Manages  blocks  replicaEon   •  Applies  transacEon  log  on  startup    
  • 23. HDFS  –  Fault  tolerance   •  DataNode     §  Uses  CRC  to  avoid  corrupEon   §  Data  is  replicated  on  other  nodes  (3x)     •  NameNode     §  Checkpoint  NameNode   §  Backup  NameNode     §  Failover  is  manual  
  • 24. MapReduce  -­‐  Design   •  Based  on  Google’s  MR  paper   •  Borrows  from  funcEonal  programming   •  Simpler  programming  model     §  map  (in_key,  in_value)    -­‐>  (out_key,  intermediate_value)  list   §  reduce  (out_key,  intermediate_value  list)     -­‐>  out_value  list     •  No  user  synchronizaEon  and  coordinaEon   Input  -­‐>  Map  -­‐>  Reduce  -­‐>  Output  
  • 25. MapReduce  -­‐  Architecture   Client  launches  a  job   J   JobsTracker  (JT)     O   -­‐  ConfiguraEon   -­‐  Mapper   B   -­‐  Reducer   S   -­‐  Input     -­‐  Output   TaskTracker  1   TaskTracker  2   TaskTracker  N   API   JobTracker  -­‐  Master   TaskTracker  -­‐  Slaves       •  Accepts  MR  jobs  submiked  by  clients   •  Run  Map  and  Reduce  tasks  received   •  Assigns  Map  and  Reduce  tasks  to   from  Jobtracker     TaskTrackers,  data  locality  aware   •  Manage  storage  and  transmission  of   •  Monitors  tasks  and  TaskTracker  status,   intermediate  output   re-­‐executes  tasks  upon  failure     •  SpeculaEve  execuEon  
  • 26. Hadoop  -­‐  Core  Architecture   J   JobsTracker   O   B   S     API   TaskTracker  1   TaskTracker  2   TaskTracker  N   DataNode        1   DataNode        2   DataNode        N   H   D   F   S   NameNode   Mini  OS   •  File  system   •  Scheduler  
  • 27. MapReduce  –  Head  First  Style   hkp://www.slideshare.net/esaliya/mapreduce-­‐ in-­‐simple-­‐terms  
  • 28. MapReduce  –  Mapper  Types   One-­‐to-­‐One   map(k,  v)  =  emit  (k,  transform(v))     Exploder   map(k,  v)  =  foreach  p  in  v:  emit  (k,  p)     Filter   map(k,  v)  =  if  cond(v)  then  emit  (k,  v)  
  • 29. MapReduce  –  Reducer  Types   Sum  Reducer     reduce(k,  vals)  =     sum  =  0   foreach  v  in  vals:  sum  +=  v   emit  (k,  sum)    
  • 30. MapReduce  –  High  level  pipeline   K1   K2   K1   K1   K2   K2   K1   K2  
  • 31. MapReduce  –  Detailed  pipeline   Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  
  • 32. MapReduce  –  Combiner  Phase   •  OpEonal   •  Runs  on  mapper  nodes  aSer  map  phase     •  “  Mini-­‐reduce,”  only  on  local  map  output     •  Used  to  save  bandwidth  before  sending  data  to  full  reducer     •  The  Reducer  can  be  Combiner  if     1.  Output  key,  values  are  the  same  as  input  key,  values   2.  CommutaEve  and  AssociaEve  (SUM,  MAX  ok  but  AVG  not)   Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  
  • 33. InstallaAon   1.  Download  &  configure  single-­‐node  cluster   hadoop.apache.org/common/releases.html   2.  Download  a  demo  VM     Cloudera   Hortonwork   3.  Use  a  hosted  environment  (Amazon’s  EMR,  Azure)  
  • 34. InstallaAon  –  Pla[orm  Notes   ProducAon        Linux  –  Official     Development        Linux      OSX      Windows  via  Cygwin      *Nix  
  • 35. MapReduce  –  Client  Languages   Java,  Any  JVM  Languages  -­‐  NaEve     hadoop  jar  jar_path  main_class  input_path  output_path     C++  -­‐  Pipes  framework  –  Socket  IO     hadoop  pipes  -­‐input  path_in  -­‐output  path_out  -­‐program  exec_program     Any  –  Streaming  –  Stdin  /  Stdout       hadoop  jar  hadoop-­‐streaming.jar  -­‐mapper  map_prog  -­‐reducer  reduce_prog  -­‐input   path_in  -­‐output  path_out       Pig  LaEn,  Hive  HQL,  C  via  JNI  
  • 36. MapReduce  –  Client  Anatomy   •  Main  Program  (aka  Driver)      Configures  the  Job    IniEates  the  Job   •  Input  LocaEon   •  Mapper   •  Combiner  (opEonal)   •  Reducer   •  Output  LocaEon  
  • 37. MapReduce  –  Word  Count  Example  
  • 38. MapReduce  –  C#  Mapper  
  • 39. MapReduce  –  C#  Reducer  
  • 40. MapReduce  –  Java  Mapper  
  • 41. MapReduce  –  Java  Reducer  
  • 44. Summary                                                          is  an  economical  scalable  distributed   data  processing  system  which  enables  data:     ü  ConsolidaAon  (Structured  or  Not)   ü  Query  Flexibility  (Any  Language)   ü  Agility  (Evolving  Schemas)  
  • 46. References   Hadoop  at  Yahoo!,  by  Y!  Developer  Network     MapReduce  in  Simple  Terms,  by Saliya Ekanayake     Hadoop  Architecture,  by Phillipe Julio     10  Hadoop-­‐able  Problems,  by Cloudera     Hadoop,  An  Industry  PerspecEve,  by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop