An	  Introduc+on	  to	  	    Data	  Intensive	  Compu+ng	                      	  Chapter	  3:	  Processing	  Big	  Data	 ...
1.  Introduc+on	  (0830-­‐0900)	       a.  Data	  clouds	  (e.g.	  Hadoop)	       b.  U+lity	  clouds	  (e.g.	  Amazon)	  ...
Sec+on	  3.1	  Processing	  Big	  Data	  Using	  U+lity	  and	  Data	  Clouds	                                   A	  Googl...
•  How	  do	  you	  do	  analy+cs	  over	  commodity	     disks	  and	  processors?	  •  How	  do	  you	  improve	  the	  ...
Serial	  &	  SMP	  Algorithms	                 Task	                             Task	                                    ...
Pleasantly	  (=	  Embarrassingly)	  Parallel	  	      Task	         Task	                      Task	                      ...
How	  Do	  You	  Program	  A	  Data	  Center?	                                                      7	  
The	  Google	  Data	  Stack	  •  The	  Google	  File	  System	  (2003)	  •  MapReduce:	  Simplified	  Data	  Processing…	  ...
Google’s	  Large	  Data	  Cloud	        Applica+ons	     Compute	  Services	                 Google’s	  MapReduce	  Data	 ...
Hadoop’s	  Large	  Data	  Cloud	  	                (Open	  Source)	        Applica+ons	    Compute	  Services	      Hadoop...
A	  very	  nice	  recent	  book	  by	  	  Barroso	  and	  Holzle	  
The	  Amazon	  Data	  Stack	                       Amazon	  uses	  a	  highly	                       decentralized,	  loos...
Amazon	  Style	  Data	  Cloud	                                 Load	  Balancer	                             Simple	  Queue...
Open	  Source	  Versions	  •  Eucalyptus	      –  Ability	  to	  launch	  VMs	      –  S3	  like	  storage	  •  Open	  Sta...
Some	  Programming	  Models	  for	  Data	  Centers	  •  Opera+ons	  over	  data	  center	  of	  disks	      –  MapReduce	 ...
Sec+on	  3.2	  	  	  	  Processing	  Data	  By	  Scaling	  Out	  	  Virtual	  Machines	  
Processing	  Big	  Data	  PaCern	  1:	  	  Launch	  Independent	  Virtual	  Machines	    and	  Task	  with	  a	  Messaging...
Task	  With	  Messaging	  Service	  Task	                      &	  Use	  S3	  (Variant	  1)	  VM	        Control	  VM:	  L...
Task	  With	  Messaging	  Service	  Task	                  &	  Use	  NoSQL	  DB	  (Variant	  2)	  VM	        Control	  VM:...
Task	  With	  Messaging	  Service	  Task	                &	  Use	  Clustered	  FS	  (Variant	  3)	  VM	        Control	  V...
Sec+on	  3.3	  MapReduce	   Google	  2004	   Technical	  Report	  
Core	  Concepts	  •  Data	  are	  (key,	  value)	  pairs	  and	  that’s	  it	  •  Par++on	  data	  over	  commodity	  node...
Processing	  Big	  Data	  PaCern	  2:	  	           MapReduce	  
Map	                              Task	             Reduce	              Map	               Map	             Task	        ...
Example:	  Word	  Count	  &	  Inverted	  Index	                                                               •  How	  do	...
•  Assume	  you	  have	  a	  cluster	  of	  50	  computers,	  each	     with	  an	  aCached	  local	  disk	  and	  half	  ...
Basic	  PaCern:	  Strings	  1.	  Extract	  words	       2.	  Hash	  and	     3.	  Count	  (or	  from	  web	  pages	  in	  ...
What	  about	  data	  records?	    1.	  Extract	  words	       2.	  Hash	  and	       3.	  Count	  (or	    from	  web	  pa...
Map-­‐Reduce	  Example	  •  Input	  is	  files	  with	  one	  document	  per	  record	  •  User	  specifies	  map	  func+on	...
Example	  (cont d)	     •  MapReduce	  library	  gathers	  together	  all	  pairs	        with	  the	  same	  key	  value	...
Why	  Is	  Word	  Count	  Important?	  •  It	  is	  one	  of	  the	  most	  important	  examples	  for	     the	  type	  o...
Pleasantly	  Parallel	   MapReduce	  Data	  structure	     Arbitrary	             (key,	  value)	  pairs	  Func+ons	      ...
Common	  MapReduce	  Design	  PaCerns	  •    Word	  count	  •    Inversion	  –	  inverted	  index	  •    Compu+ng	  simple...
Sec+on	  3.4	  User	  Defined	  Func+ons	  over	  DFS	                 sector.sf.net	  
Processing	  Big	  Data	  PaCern	  3:	  	   User	  Defined	  Func+ons	  over	     Distributed	  File	  Systems	  
Sector/Sphere	  •  Sector/Sphere	  is	  a	  plaworm	  for	  data	  intensive	     compu+ng.	  	  
Idea	  1:	  Apply	  User	  Defined	  Func+ons	  (UDF)	  to	  Files	  in	  a	  Distributed	  File	  System	                 ...
Idea	  2:	  Add	  Security	  From	  the	  Start	  Security                     •  Security	  server	  maintains	          ...
Idea	  3:	  Extend	  the	  Stack	  to	  Include	     Network	  Transport	  Services	    Compute	  Services	            Com...
Sec+on	  3.5	  	  Compu+ng	  With	  Streams:	  	  Warming	  Up	  With	  Means	  and	  Variances	  
Warm	  Up:	  Par++oned	  Means	                                               Step	  1.	  Compute	  local	                ...
Trivial	  Observa+on	  1	  If	  si	  =	  Σ	  xi	  is	  a	  the	  i’th	  local	  means,	  then	  global	  mean	  =	  Σ	  si...
Trivial	  Observa+on	  2	  •  To	  reduce	  data	  passed	  over	  the	  network,	     combine	  appropriate	  sta+s+cs	  ...
Sec+on	  3.6	  Hadoop	  Streams	  
Processing	  Big	  Data	  PaCern	  4:	  	  Streams	  over	  Distributed	  File	  Systems	  
Hadoop	  Streams	  •  In	  addi+on	  to	  the	  Java	  API,	  Hadoop	  offers	      –  Streaming	  interface	  for	  any	  ...
Pros	  and	  Cons	  •  Java	       +	  	  Best	  documented	       +	  	  Largest	  community	       –  More	  LOC	  per	 ...
Word	  Count	  Python	  Mapper	  	  def read_input(file):    for line in file:        yield line.split()def main(separator...
Word	  Count	  Python	  Reducer	  def read_mapper_output(file, separator=t):    for line in file:        yield line.rstrip...
MalStone	  Benchmark	                                                     MalStone	  A	              MalStone	  B	     Had...
Word	  Count	  R	  Mapper	  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)","", line)con <- file("stdin", open = "r")w...
Word	  Count	  R	  Reducer	  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)splitLine <- function(line) {  ...
Word	  Count	  R	  Reducer	  (cont’d)	  if (exists(word, envir = env, inherits = FALSE)) {        oldcount <- get(word, en...
Word	  Count	  Java	  Mapper	  public static class Map   extends Mapper<LongWritable, Text,Text, IntWritable>            p...
Word	  Count	  Java	  Reducer	  public static class Reduce    extends Reducer<Text, IntWritable, Text, IntWritable> {     ...
Code	  Comparison	  –	  Word	  Count	                          Mapper	  Python                                            ...
Code	  Comparison	  –	  Word	  Count	                         Reducer	  Python                                            ...
Ques+ons?	  For	  the	  most	  current	  version	  of	  these	  notes,	  see	                           rgrossman.com	  
Upcoming SlideShare
Loading in...5
×

Processing Big Data (Chapter 3, SC 11 Tutorial)

2,869

Published on

This is Chapter 3 of a tutorial that I gave at SC 11 on November 14, 2011.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,869
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
118
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Processing Big Data (Chapter 3, SC 11 Tutorial)

  1. 1. An  Introduc+on  to     Data  Intensive  Compu+ng    Chapter  3:  Processing  Big  Data   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  2. 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)  2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)  3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems  4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  3. 3. Sec+on  3.1  Processing  Big  Data  Using  U+lity  and  Data  Clouds   A  Google  produc+on  rack  of   servers  from  about  1999.  
  4. 4. •  How  do  you  do  analy+cs  over  commodity   disks  and  processors?  •  How  do  you  improve  the  efficiency  of   programmers?  
  5. 5. Serial  &  SMP  Algorithms   Task   Task   Task   Task   local  disk*   local  disk*   Serial  algorithm   Symmetric   Mul+processing   (SMP)  algorithm  •  *  local  disk  and  memory  
  6. 6. Pleasantly  (=  Embarrassingly)  Parallel     Task   Task   Task   Task   Task   Task   Task   Task   Task   local  disk   local  disk   local  disk   MPI  •  Need  to  par++on  data,  start  tasks,  collect  results.    •  Oden  tasks  organized  into  DAG.  
  7. 7. How  Do  You  Program  A  Data  Center?   7  
  8. 8. The  Google  Data  Stack  •  The  Google  File  System  (2003)  •  MapReduce:  Simplified  Data  Processing…  (2004)  •  BigTable:  A  Distributed  Storage  System…  (2006)   8  
  9. 9. Google’s  Large  Data  Cloud   Applica+ons   Compute  Services   Google’s  MapReduce  Data  Services   Google’s  BigTable   Storage  Services   Google  File  System  (GFS)  Google’s  Early  Data  Stack   circa  2000   9
  10. 10. Hadoop’s  Large  Data  Cloud     (Open  Source)   Applica+ons   Compute  Services   Hadoop’s  MapReduce  Data  Services   NoSQL,  e.g.  HBase   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   10
  11. 11. A  very  nice  recent  book  by    Barroso  and  Holzle  
  12. 12. The  Amazon  Data  Stack   Amazon  uses  a  highly   decentralized,  loosely  coupled,   service  oriented  architecture   consis+ng  of  hundreds  of   services.  In  this  environment   there  is  a  par+cular  need  for   storage  technologies  that  are   always  available.  For  example,   customers  should  be  able  to   view  and  add  items  to  their   shopping  cart  even  if  disks  are   failing,  network  routes  are   flapping,  or  data  centers  are   being  destroyed  by  tornados.    SOSP’07  
  13. 13. Amazon  Style  Data  Cloud   Load  Balancer   Simple  Queue  Service  SDB   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instances   S3  Storage  Services   13
  14. 14. Open  Source  Versions  •  Eucalyptus   –  Ability  to  launch  VMs   –  S3  like  storage  •  Open  Stack   –  Ability  to  launch  VMs   –  S3  like  storage  -­‐  Swid    •  Cassandra   –  Key-­‐value  store  like  S3   –  Columns  like  BigTable  •  Many  other  open  source  Amazon  style  services   available.  
  15. 15. Some  Programming  Models  for  Data  Centers  •  Opera+ons  over  data  center  of  disks   –  MapReduce  (“string-­‐based”  scans  of  data)   –  User-­‐Defined  Func+ons  (UDFs)  over  data  center   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  disk-­‐based  data.   –  SQL  and  NoSQL  over  data  center  •  Opera+ons  over  data  center  of  memory   –  Grep  over  distributed  memory   –  UDFs  over  distributed  memory   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  membory-­‐based  data.   –  SQL  and  NoSQL  over  distributed  memory  
  16. 16. Sec+on  3.2        Processing  Data  By  Scaling  Out    Virtual  Machines  
  17. 17. Processing  Big  Data  PaCern  1:    Launch  Independent  Virtual  Machines   and  Task  with  a  Messaging  Service  
  18. 18. Task  With  Messaging  Service  Task   &  Use  S3  (Variant  1)  VM   Control  VM:  Launches  and   tasks  workers  Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   S3  
  19. 19. Task  With  Messaging  Service  Task   &  Use  NoSQL  DB  (Variant  2)  VM   Control  VM:  Launches  and   tasks  workers  Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   AWS  SimpleDB  
  20. 20. Task  With  Messaging  Service  Task   &  Use  Clustered  FS  (Variant  3)  VM   Control  VM:  Launches  and   tasks  workers  Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   GlusterFS  
  21. 21. Sec+on  3.3  MapReduce   Google  2004   Technical  Report  
  22. 22. Core  Concepts  •  Data  are  (key,  value)  pairs  and  that’s  it  •  Par++on  data  over  commodity  nodes  filling  racks   in  a  data  center.  •  Sodware  handles  failures,  restarts,  etc.  This  is   the  hard  part.    •  Basic  examples:   –  Word  Count   –  Inverted  index  
  23. 23. Processing  Big  Data  PaCern  2:     MapReduce  
  24. 24. Map   Task   Reduce   Map   Map   Task   Tracker   Task   Task   Task   HDFS  HDFS   local  disk   local  disk   Map   Task   Map   Map   Task   Tracker   Task   Task   Reduce   Task  HDFS   local  disk   HDFS   Map   Task   Map   Map   Task   Tracker   local  disk   Task   Task  HDFS   local  disk   Shuffle  &  Sort  
  25. 25. Example:  Word  Count  &  Inverted  Index   •  How  do  you  count   the  words  in  a   million  books?   –  (best,  7)   •  Inverted  index:   –  (best;  page  1,  page   82,  …)   –  (worst;  page  1,   page  12,  …)    Cover  of  serial  Vol.  V,  1859,  London  
  26. 26. •  Assume  you  have  a  cluster  of  50  computers,  each   with  an  aCached  local  disk  and  half  full  of  web   pages.  •  What  is  a  simple  parallel  programming  framework   that  would  support  the  computa+on  of  word  counts   and  inverted  indices?  
  27. 27. Basic  PaCern:  Strings  1.  Extract  words   2.  Hash  and   3.  Count  (or  from  web  pages  in   sort  words.   construct  inverted  parallel.   index)  in  parallel.  
  28. 28. What  about  data  records?   1.  Extract  words   2.  Hash  and   3.  Count  (or   from  web  pages  in   sort  words.   construct  inverted   parallel.   index)  in  parallel.   1.  Extract  binned   2.  Hash  and   3.  Count  (or   field  value  from   sort  binned   construct  inverted   data  records  in   field  values.   index)  in  parallel.   parallel.  
  29. 29. Map-­‐Reduce  Example  •  Input  is  files  with  one  document  per  record  •  User  specifies  map  func+on   –  key  =  document  URL   –  Value  =  document  contents  Input  of  map   doc  cdickens  two  ci+es ,   it  was  the  best  of  +mes  Output  of  map   it ,  1   was ,  1   the ,  1   best ,  1  
  30. 30. Example  (cont d)   •  MapReduce  library  gathers  together  all  pairs   with  the  same  key  value  (shuffle/sort  phase)   •  The  user-­‐defined  reduce  func+on  combines  all   the  values  associated  with  the  same  key  Input  of  reduce  key  =   it   key  =   was   key  =   best   key  =   worst  values  =  1,  1   values  =  1,  1   values  =  1   values  =  1    Output  of  reduce   it ,  2   was ,  2   best ,  1   worst ,  1  
  31. 31. Why  Is  Word  Count  Important?  •  It  is  one  of  the  most  important  examples  for   the  type  of  text  processing  oden  done  with   MapReduce.  •  There  is  an  important  mapping                document          <  -­‐-­‐-­‐-­‐-­‐  >            data  record                      words                  <  -­‐-­‐-­‐-­‐-­‐  >              (field,  value)   Inversion  
  32. 32. Pleasantly  Parallel   MapReduce  Data  structure   Arbitrary   (key,  value)  pairs  Func+ons   Arbitrary   Map  &  Reduce  Middleware   MPI  (message   Hadoop   passing)  Ease  of  use   Difficult   Medium  Scope   Wide   Narrow  Challenge     Geung  something   Moving  to   working   MapReduce    
  33. 33. Common  MapReduce  Design  PaCerns  •  Word  count  •  Inversion  –  inverted  index  •  Compu+ng  simple  sta+s+cs  •  Compu+ng  windowed  sta+s+cs  •  Sparse  matrix  (document-­‐term,  data  record-­‐ FieldBinValue,  …)  •   Site-­‐en+ty  sta+s+cs  •  PageRank  •  Par++oned  and  ensemble  models  •  EM  
  34. 34. Sec+on  3.4  User  Defined  Func+ons  over  DFS   sector.sf.net  
  35. 35. Processing  Big  Data  PaCern  3:     User  Defined  Func+ons  over   Distributed  File  Systems  
  36. 36. Sector/Sphere  •  Sector/Sphere  is  a  plaworm  for  data  intensive   compu+ng.    
  37. 37. Idea  1:  Apply  User  Defined  Func+ons  (UDF)  to  Files  in  a  Distributed  File  System   map/shuffle reduce UDF UDFThis  generalizes  Hadoop’s  implementa+on  of  MapReduce  over  the  Hadoop  Distributed  File  system.  
  38. 38. Idea  2:  Add  Security  From  the  Start  Security •  Security  server  maintains   Master Client informa+on  about  users  Server SSL and  slaves.   SSL •  User  access  control:   password  and  client  IP   address.   AAA data •  File  level  access  control.   •  Messages  are  encrypted   over  SSL.  Cer+ficate  is   used  for  authen+ca+on.   •  Sector  is  a  good  basis  for   HIPAA  compliant   Slaves applica+ons.  
  39. 39. Idea  3:  Extend  the  Stack  to  Include   Network  Transport  Services   Compute  Services   Compute  Services  Data  Services   Data  Services   Storage  Services   Storage  Services   Rou+ng  &     Google,  Hadoop   Transport  Services   Sector   39  
  40. 40. Sec+on  3.5    Compu+ng  With  Streams:    Warming  Up  With  Means  and  Variances  
  41. 41. Warm  Up:  Par++oned  Means   Step  1.  Compute  local   (Σ  xi,    Σ  xi2,    ni)   in  parallel  for  each   par++on.     Step  2.  Compute  global   mean  and  variance  from   these  tuples.      •  Means  and  variances  cannot  be  computed   naively  when  the  data  is  in  distributed   par++ons.  
  42. 42. Trivial  Observa+on  1  If  si  =  Σ  xi  is  a  the  i’th  local  means,  then  global  mean  =  Σ  si  /    Σ  ni.    •  If  local  means  for  each  par++on  are  passed   (without  corresponding  counts),  then  there  is   not  enough  informa+on  to  compute  global   means.  •  Same  tricks  works  for  variance,  but  need  to   pass  triples  (Σ  xi,    Σ  xi2,    ni).    
  43. 43. Trivial  Observa+on  2  •  To  reduce  data  passed  over  the  network,   combine  appropriate  sta+s+cs  as  early  as   possible.  •  Consider  average.      Recall  with  MapReduce  there   are  4  steps  (Map,  Shuffle,  Sort  and  Reduce)  and   Reduce  pulls  data  from  local  disk  that  performs   Map.  •  A  Combine  Step  in  MapReduce  combines  local   data  before  it  is  pulled  for  Reduce  Step.  •  There  are  built  in  combiners  for  counts,  means,   etc.  
  44. 44. Sec+on  3.6  Hadoop  Streams  
  45. 45. Processing  Big  Data  PaCern  4:    Streams  over  Distributed  File  Systems  
  46. 46. Hadoop  Streams  •  In  addi+on  to  the  Java  API,  Hadoop  offers   –  Streaming  interface  for  any  language  that  supports   reading  and  wri+ng  to  Standard  In  and  Out   –  Pipes  for  C++  •  Why  would  I  want  to  use  something  besides   Java?    Because  Hadoop  Streams  provide  direct   access  to   –  (Without  JNI/  NIO)  to  C++  libraries  like  Boost,  GNU   Scien+fic  Library  (GSL)   –  R  modules  
  47. 47. Pros  and  Cons  •  Java   +    Best  documented   +    Largest  community   –  More  LOC  per  MR  job  •  Python   +    Efficient  memory  handling   +    Programmers  can  be  very  efficient   –  Limited  logging  /  debugging  •  R   +    Vast  collec+on  of  sta+s+cal  algorithms   –  Poor  error  handling  and  memory  handling   –  Less  familiar  to  developers  
  48. 48. Word  Count  Python  Mapper    def read_input(file): for line in file: yield line.split()def main(separator=t): data = read_input(sys.stdin) for words in data: for word in words: print %s%s%d % (word, separator, 1)
  49. 49. Word  Count  Python  Reducer  def read_mapper_output(file, separator=t): for line in file: yield line.rstrip().split(separator, 1)def main(sep=t): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
  50. 50. MalStone  Benchmark   MalStone  A   MalStone  B   Hadoop  MapReduce   455m  13s   840m  50s   Hadoop  Streams   87m  29s   142m  32s   (Python)   C++  implemented  UDFs   33m  40s   43m  44s  Sector/Sphere  1.20,  Hadoop  0.18.3  with  no  replica+on  on  Phase  1  of  Open  Cloud  Testbed  in  a  single  rack.    Data  consisted  of  20  nodes  with  500  million  100-­‐byte  records  /  node.  
  51. 51. Word  Count  R  Mapper  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)","", line)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn =FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="")}close(con)  
  52. 52. Word  Count  R  Reducer  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2]))}env <- new.env(hash = TRUE)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) >0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count  
  53. 53. Word  Count  R  Reducer  (cont’d)  if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env)}close(con)for (w in ls(env, all = TRUE)) cat(w, "t", get(w, envir = env), "n", sep ="”)  
  54. 54. Word  Count  Java  Mapper  public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }  
  55. 55. Word  Count  Java  Reducer  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }  
  56. 56. Code  Comparison  –  Word  Count   Mapper  Python Javadef read_input(file): for line in file: public static class Map yield line.split() extends Mapper<LongWritable, Text,Text, IntWritable>def main(separator=t): private final static IntWritable one = new IntWritable(1); data = read_input(sys.stdin) private Text word = new Text(); for words in data: for word in words: public void map(LongWritable key, Text value, Context context print %s%s%d % (word, separator, 1) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }RtrimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="")}close(con)
  57. 57. Code  Comparison  –  Word  Count   Reducer  Python if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env)def read_mapper_output(file, separator=t): } for line in file: else assign(word, count, envir = env) yield line.rstrip().split(separator, 1) } close(con)def main(sep=t): for (w in ls(env, all = TRUE)) data = read_mapper_output(sys.stdin, sep=sepa) cat(w, "t", get(w, envir = env), "n", sep = "”) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context)R throws IOException, InterruptedException {trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) int sum = 0; for (IntWritable val : values) {splitLine <- function(line) { sum += val.get(); val <- unlist(strsplit(line, "t")) } list(word = val[1], count = as.integer(val[2])) context.write(key, new IntWritable(sum));} } }env <- new.env(hash = TRUE)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
  58. 58. Ques+ons?  For  the  most  current  version  of  these  notes,  see   rgrossman.com  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×