• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Map Reduce An Introduction
 

Map Reduce An Introduction

on

  • 319 views

How Map Reduce evolved

How Map Reduce evolved

Statistics

Views

Total Views
319
Views on SlideShare
315
Embed Views
4

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 4

http://www.linkedin.com 3
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Map Reduce An Introduction Map Reduce An Introduction Presentation Transcript

    • -  Nagarjuna K nagarjuna@outlook.com  
    • •  Understanding  MapReduce  •  Map  Reduce    -­‐  An  Introduction   •  Word  count  –  default   •  Word  count  –  custom         nagarjuna@outlook.com  
    • ¡  Programming  model  to  process  large  datasets  ¡  Supported  languages  for  MR   §  Java   §  Ruby   §  Python   §  C++  ¡  Map  Reduce  Programs  are  Inherently  parallel.     §  More  data  è  more  machines  to  analyze.     §  No  need  to  change  anything  in  the  code.     nagarjuna@outlook.com  
    • ¡  Start  with  WORDCOUNT  example   §  “Do  as  I  say,  not  as  I  do”   Word     Count     As   2   Do   2   I   2   Not   2   Say   1   nagarjuna@outlook.com  
    • define  wordCount  as  Map<String,long>;          for  each  document  in  documentSet  {        T  =  tokenize(document);        for  each  token  in  T  {          wordCount[token]++;        }          }    display(wordCount);      ¡  This  works  until  the  no.of  documents  to  process  is  not   very  large   nagarjuna@outlook.com  
    • ¡  Spam  filter   §  Millions  of  emails   §  Word  count  for  analysis  ¡  Working  from  a  single  computer  is  time   consuming  ¡  Rewrite  the  program  to  count  form  multiple   machines   nagarjuna@outlook.com  
    • ¡  How  do  we  attain  parallel  computing  ?     1.  All  the  machines  compute  fraction  of   documents   2.  Combine  the  results  from  all  the  machines   nagarjuna@outlook.com  
    • STAGE  1  define  wordCount  as  Map<String,long>;          for  each  document  in  documentSUBSet  {        T  =  tokenize(document);        for  each  token  in  T  {          wordCount[token]++;        }          }     nagarjuna@outlook.com  
    • STAGE  2    define  totalWordCount  as  Multiset;      for  each  wordCount  received  from  firstPhase  {      multisetAdd  (totalWordCount,  wordCount);    }    Display(totalWordcount)   nagarjuna@outlook.com  
    • Master     Comp-­‐1   Comp-­‐2  Documents   Comp-­‐3   Comp-­‐4   nagarjuna@outlook.com  
    • Problems   STAGE  1     •  Documents  segregations  to  be  well     Master     defined   Comp-­‐1   •  Bottle  neck  in  network  transfer   •  Data-­‐intensive  processing   •  Not  computational  intensive   Comp-­‐2   •  So  better  store  files  over  Documents   processing  machines   •  BIGGEST  FLAW   Comp-­‐3   •  Storing  the  words  and  count  in   memory   •  Disk  based  hash-­‐table   Comp-­‐4   nagarjuna@outlook.com   implementation  needed  
    • Problems   STAGE  2   Master     •  Phase  2  has  only  once  machine   •  Bottle  Neck   Comp-­‐1   •  Phase  1  highly  distributed  though   •  Make  phase  2  also  distributed   Comp-­‐2  Documents   •  Need  changes  in  Phase  1   •  Partition  the  phase-­‐1  output  (say  based   on  first  character  of  the  word)   Comp-­‐3   •  We  have  26  machines  in  phase  2     •  Single  Disk  based  hash-­‐table  should  be   now  26  Disk  based  hash-­‐table     •  Word  count-­‐a  ,  worcount-­‐b,wordcount-­‐c   Comp-­‐4   nagarjuna@outlook.com  
    •     Master     A   B   C   D   E   Comp-­‐1   Comp-­‐10   1   2   4   5   10   Comp-­‐2   Comp-­‐20  Documents   A   B   C   D   E   10   20   40   5   9   Comp-­‐3   Comp-­‐30   .   .   .   Comp-­‐4   nagarjuna@outlook.com     Comp-­‐40  
    • ¡  After  phase-­‐1   §  From  comp-­‐1   ▪  WordCount-­‐A  à  comp-­‐10   ▪  WordCount-­‐B  à  comp-­‐20   ▪  .   ▪  .   ▪  .  ¡  Each  machine  in  phase  1  will  shuffle  its  output  to   different  machines  in  phase  2     nagarjuna@outlook.com  
    • ¡  This  is  getting  complicated   §  Store  files  where  are  they  are  being  processed   §  Write  disk-­‐based  hash  table  obviating  RAM   limitations   §  Partition  the  phase-­‐1  output   §  Shuffle  the  phase-­‐1  output  and  send  it  to   appropriate  reducer   nagarjuna@outlook.com  
    • ¡  This  is  more  than  a  lot  for  word  count  ¡  We  haven’t  even  touched  the  fault  tolerance   §  What  if  comp-­‐1  or  com-­‐10  fails  ¡  So,  A  need  of  frame  work  to  take  care  of  all   these  things     §  We  concentrate  only  on  business     nagarjuna@outlook.com  
    • Interim     MAPPER   output   REDUCER     Master     A   B   C   D   E   Comp-­‐1   Comp-­‐10   Shuffling   Partitioning   1   2   4   5   10   Comp-­‐2   Comp-­‐20   Documents   A   B   C   D   E  HDFS   1   2   4   5   10   Comp-­‐3   Comp-­‐30   .   .   .   Comp-­‐4   nagarjuna@outlook.com     Comp-­‐40  
    • ¡  Mapper  ¡  Reducer    Mapper  filters  and  transforms  the  input    Reducer  collects  that  and  aggregate  on  that.    Extensive  research  is  done  two  arrive  at  two  phase  strategy     nagarjuna@outlook.com  
    • ¡  Mapper,Reducer,Partitioner,Shuffling   §  Work  together  à  common  structure  for  data   processing     Input   Output   Mapper   <K1,V1>   List<K2,V2>   Reducer   <k2,list(v2)>   List<k3,v3>   nagarjuna@outlook.com  
    • ¡  Mapper   §  <key,words_per_line>    :  Input   §  <word,1>  :  output   Input   Output  ¡  Reducer   Mapper   <K1,V1>   List<K2,V2>   Reducer   <k2,list(v2)>   List<k3,v3>   §  <word,list(1)>    :  Input   §  <word,count(list(1))>    :  Output   nagarjuna@outlook.com  
    • ¡  As  said,  don’t  store  the  data  in  memory   §  So  keys  and  values  regularly  have  to  be  written  to   disk.   §  They  must  be  serialized.   §  Hadoop  provides  its  way  of  deserialization   §  Any  class  to  be  key  or  value  have  to  implement   WRITABLE  class.   nagarjuna@outlook.com  
    • Java  Type   Hadoop  Serialized   Types   String   Text   Integer   IntWritable   Long   LongWritable  nagarjuna@outlook.com  
    • ¡  Let’s  try  to  execute  the  following  command     ▪  hadoop  jar  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar   wordcount     ▪  hadoop  jar  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar   wordcount  <input>    <output>  ¡  What  does  this  code  do  ?   nagarjuna@outlook.com  
    • ¡  Switch  to  eclipse   nagarjuna@outlook.com