Map Reduce An Introduction

346 views

Published on

How Map Reduce evolved

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
346
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Map Reduce An Introduction

  1. 1. -  Nagarjuna K nagarjuna@outlook.com  
  2. 2. •  Understanding  MapReduce  •  Map  Reduce    -­‐  An  Introduction   •  Word  count  –  default   •  Word  count  –  custom         nagarjuna@outlook.com  
  3. 3. ¡  Programming  model  to  process  large  datasets  ¡  Supported  languages  for  MR   §  Java   §  Ruby   §  Python   §  C++  ¡  Map  Reduce  Programs  are  Inherently  parallel.     §  More  data  è  more  machines  to  analyze.     §  No  need  to  change  anything  in  the  code.     nagarjuna@outlook.com  
  4. 4. ¡  Start  with  WORDCOUNT  example   §  “Do  as  I  say,  not  as  I  do”   Word     Count     As   2   Do   2   I   2   Not   2   Say   1   nagarjuna@outlook.com  
  5. 5. define  wordCount  as  Map<String,long>;          for  each  document  in  documentSet  {        T  =  tokenize(document);        for  each  token  in  T  {          wordCount[token]++;        }          }    display(wordCount);      ¡  This  works  until  the  no.of  documents  to  process  is  not   very  large   nagarjuna@outlook.com  
  6. 6. ¡  Spam  filter   §  Millions  of  emails   §  Word  count  for  analysis  ¡  Working  from  a  single  computer  is  time   consuming  ¡  Rewrite  the  program  to  count  form  multiple   machines   nagarjuna@outlook.com  
  7. 7. ¡  How  do  we  attain  parallel  computing  ?     1.  All  the  machines  compute  fraction  of   documents   2.  Combine  the  results  from  all  the  machines   nagarjuna@outlook.com  
  8. 8. STAGE  1  define  wordCount  as  Map<String,long>;          for  each  document  in  documentSUBSet  {        T  =  tokenize(document);        for  each  token  in  T  {          wordCount[token]++;        }          }     nagarjuna@outlook.com  
  9. 9. STAGE  2    define  totalWordCount  as  Multiset;      for  each  wordCount  received  from  firstPhase  {      multisetAdd  (totalWordCount,  wordCount);    }    Display(totalWordcount)   nagarjuna@outlook.com  
  10. 10. Master     Comp-­‐1   Comp-­‐2  Documents   Comp-­‐3   Comp-­‐4   nagarjuna@outlook.com  
  11. 11. Problems   STAGE  1     •  Documents  segregations  to  be  well     Master     defined   Comp-­‐1   •  Bottle  neck  in  network  transfer   •  Data-­‐intensive  processing   •  Not  computational  intensive   Comp-­‐2   •  So  better  store  files  over  Documents   processing  machines   •  BIGGEST  FLAW   Comp-­‐3   •  Storing  the  words  and  count  in   memory   •  Disk  based  hash-­‐table   Comp-­‐4   nagarjuna@outlook.com   implementation  needed  
  12. 12. Problems   STAGE  2   Master     •  Phase  2  has  only  once  machine   •  Bottle  Neck   Comp-­‐1   •  Phase  1  highly  distributed  though   •  Make  phase  2  also  distributed   Comp-­‐2  Documents   •  Need  changes  in  Phase  1   •  Partition  the  phase-­‐1  output  (say  based   on  first  character  of  the  word)   Comp-­‐3   •  We  have  26  machines  in  phase  2     •  Single  Disk  based  hash-­‐table  should  be   now  26  Disk  based  hash-­‐table     •  Word  count-­‐a  ,  worcount-­‐b,wordcount-­‐c   Comp-­‐4   nagarjuna@outlook.com  
  13. 13.     Master     A   B   C   D   E   Comp-­‐1   Comp-­‐10   1   2   4   5   10   Comp-­‐2   Comp-­‐20  Documents   A   B   C   D   E   10   20   40   5   9   Comp-­‐3   Comp-­‐30   .   .   .   Comp-­‐4   nagarjuna@outlook.com     Comp-­‐40  
  14. 14. ¡  After  phase-­‐1   §  From  comp-­‐1   ▪  WordCount-­‐A  à  comp-­‐10   ▪  WordCount-­‐B  à  comp-­‐20   ▪  .   ▪  .   ▪  .  ¡  Each  machine  in  phase  1  will  shuffle  its  output  to   different  machines  in  phase  2     nagarjuna@outlook.com  
  15. 15. ¡  This  is  getting  complicated   §  Store  files  where  are  they  are  being  processed   §  Write  disk-­‐based  hash  table  obviating  RAM   limitations   §  Partition  the  phase-­‐1  output   §  Shuffle  the  phase-­‐1  output  and  send  it  to   appropriate  reducer   nagarjuna@outlook.com  
  16. 16. ¡  This  is  more  than  a  lot  for  word  count  ¡  We  haven’t  even  touched  the  fault  tolerance   §  What  if  comp-­‐1  or  com-­‐10  fails  ¡  So,  A  need  of  frame  work  to  take  care  of  all   these  things     §  We  concentrate  only  on  business     nagarjuna@outlook.com  
  17. 17. Interim     MAPPER   output   REDUCER     Master     A   B   C   D   E   Comp-­‐1   Comp-­‐10   Shuffling   Partitioning   1   2   4   5   10   Comp-­‐2   Comp-­‐20   Documents   A   B   C   D   E  HDFS   1   2   4   5   10   Comp-­‐3   Comp-­‐30   .   .   .   Comp-­‐4   nagarjuna@outlook.com     Comp-­‐40  
  18. 18. ¡  Mapper  ¡  Reducer    Mapper  filters  and  transforms  the  input    Reducer  collects  that  and  aggregate  on  that.    Extensive  research  is  done  two  arrive  at  two  phase  strategy     nagarjuna@outlook.com  
  19. 19. ¡  Mapper,Reducer,Partitioner,Shuffling   §  Work  together  à  common  structure  for  data   processing     Input   Output   Mapper   <K1,V1>   List<K2,V2>   Reducer   <k2,list(v2)>   List<k3,v3>   nagarjuna@outlook.com  
  20. 20. ¡  Mapper   §  <key,words_per_line>    :  Input   §  <word,1>  :  output   Input   Output  ¡  Reducer   Mapper   <K1,V1>   List<K2,V2>   Reducer   <k2,list(v2)>   List<k3,v3>   §  <word,list(1)>    :  Input   §  <word,count(list(1))>    :  Output   nagarjuna@outlook.com  
  21. 21. ¡  As  said,  don’t  store  the  data  in  memory   §  So  keys  and  values  regularly  have  to  be  written  to   disk.   §  They  must  be  serialized.   §  Hadoop  provides  its  way  of  deserialization   §  Any  class  to  be  key  or  value  have  to  implement   WRITABLE  class.   nagarjuna@outlook.com  
  22. 22. Java  Type   Hadoop  Serialized   Types   String   Text   Integer   IntWritable   Long   LongWritable  nagarjuna@outlook.com  
  23. 23. ¡  Let’s  try  to  execute  the  following  command     ▪  hadoop  jar  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar   wordcount     ▪  hadoop  jar  hadoop-­‐examples-­‐0.20.2-­‐cdh3u4.jar   wordcount  <input>    <output>  ¡  What  does  this  code  do  ?   nagarjuna@outlook.com  
  24. 24. ¡  Switch  to  eclipse   nagarjuna@outlook.com  

×